DOCOBBIT BBSOBB 



BO 1«6 233 

TITLE 

XHSTITOIION 
FOB DATE 
HOIE 

Af AIIABLE FROM 



EDBS PRICE 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



' TH 006 639 

Standardized Testing Issues: Teachers* Per5spectiYes« 

Reference and Resource Series.^ 

Hational Education Association, iashington, D.c. 

77 

Hational Education Association, 1201 Sixteenth Street 
M«ff«, Washington, D.C. 20Q36 (Stock Nuftber 1501-0*00, 
$5.75) 

BF-$0*83 Plus Postage* HC Hot Available froM EDRS. 
AchieTeaent Tests; Change S^trategies; Criterion 
Referenced Tests; Eleaentary School Students; 
Eleaentary Secondary Education; Evaluation Hethods; 
Hxnority Groups; ♦Standardised Tests; Student 
Reaction; ♦Student Teisting; ♦Teacher Attitudes; 
Teacher Reaiponsibility;. ♦Test Bias; '^♦Testing 
Problens; Testing Prograas; Test Interpretation 
♦Alternatives to Standardized Testing 



The probless associated with standardiz^ed testing are 
illustrated in this collection of articles. Alternatives to current 
practices and strategies for change are suggested. 'The contributors 
discuss the roles and responsibilities of groups concerned irith 
.student evaluation systeas, the testing of ainotltiy group and 
noQ-£nglish-spealcing students, probleas.in using students* -^est 
results for evaluation of teachers, and teachers* perspectives on 
testing alternatives. The 1975 report of the Hationar Education 
Association (HE A) Tasic Force on Testing and a report of the 1972 NBA 
Conferejice on Civil and Huaan Rights in. Education are appended. 
(Author/B?)^ 



♦ Docuaents acquired by ERIC include aany inforaal unpublished ♦ 

♦ aaterials not available froa other sources. ERIC takes every effort 

♦ to obtain the best copy available, nevertheless, iteas of aarginal 
♦'reproducibility are often encountered and this affects the quality 

♦ of the aicrofiche and hardcopy reproductions ERIC takes available 

♦ viaftiie ERIC Docuaent Reproduction Service (EORS) . EORS is not 

♦ resjponsible for the quality of the original docuaent. Reproductions 

♦ supplied by EDRS are the best that can be aade froa the original. 



Standardized 

Testing 

Issfues 

Teachers' Perspectives 



r J 




I 

if 

« 

Standardized 

Testing 

Issues 

Teachers' Perspectives ^ 
Reference fi* Resource Series 




National Education Association 
Washington, D.C. 



Copyright© 1977 

National Education Association of the United States 
Stock No. 1501.0^00 



Note ^ 

The opinions expressed in this publication should not be construed as representhig^ the "policy 
^or position of the National Education Association.. Materials published as part of the NEA 
Reference & Resource Series arc intended to be discussion documents for teachers who are 
concerned with specialized interests of the profession. 

Library of Congress Cataloging in Pul^lication Data . 

National Education Association of the United 
States. * 
' Standardized testing issues. ' ^ 

(Reference and resource series) « ... 

Includes bibliographical references. 
, 1. Examinations— United States. - 1. Title, 
n. Series. . ' ■ ' 

'LB3051.N3 1977 372.1T6 77-24041 
ISBN 0-8106-1501.0 



Acknowledgments " i 

"Glossary .of Measurement Terms" (in "Guidelines land Cautions for Considering Criterion- 
'ileferenced Testing" by Bernard McKcnna) is excerpted from the revised edition of /I Glossary 
of Measurejnent Terms: A Basic Vocabulary, for Evaluatiotrand Testing^ published' by CFB/ 
McGraw-Hill, Del Monte Research Park, Monterey, cjalifornia 93940. Reprinted by permission 
of the publisher. 

The following articles are reprinted with permission (xoin .Today's Education : 

"An Alternative to Blanket Standardized Testing" by Richard J. Stiggins. 

"Criticisms of Standardized Testing" by Milton G. Hplmen and Richard F. Docter. 

'The Looking-Glass World of Testi.-ig" by Edwin F. Taylor. 

"One Way It Can Be" by Brenda S. Engel. 

. "A Summ'aiy of Alternatives" 

"A Teacher Views Criterion-Referenced Tests" by Jean S. Blachfprd. 

"Teacher-Made Tests— An Alternative to Standardized Tests" by Frances Quinto. 

"llie Testing of Minority Children— A Neo-Piagetian Approach" by Edward A. De Avila 
and Barbara Ha vassy. ^ - 

"The Way It Is" by Charlotte Durehshori. , 

"What's Wron*g with Standardized Testing?" by Bernard McKenna. ^ 



4 



CONTENTS ■ . • 

What's Wrong with Standardized Testing? by Bernard McKenna ......... 7 

The Looking-Glass World of Testing by Edwin F. Taylor /. 11 

The Way It Is by Charlotte Darehshori 16 

One Way It Can Be by Brepda §. Engel . . . ; : . . . ; ' 20 ' 

Roles and Responsibilities of Groups Concerned with'Student Evaluation 

Systems by Bernard McKenna , 24 

Why Should All Those Students Take All ThoseTests? 30 

A Teacher Views Criterion-Referenced Tests by Jean S. Blachford 33 

/»*-". ' . * ^ ' 

Guidelines and Cautions for Considering Criterion-Referenced Testing 

by Bernard McKenna ........ :\ ". 35 

The Testing cf Minority Children— A Nco-Piagctian Approach by Edward A. 

De. Avila and Barbara Havassy 43 

Criticisms of Standardized Testing by Milton G. Holmen and ^ 

Richard F. Docter 48 

Problems in Using Pupil Outcomes for Teacher Evaluation by Robert S. 

Soar and Ruth M. Soar 52 . 

Teacher-Made Tests— An Alternative to Standardized Tests by Frances 

Quinto.. ,.\., : , 58 

An Alternative to Blanket Standardized Testing by Rithard J. Stiggins ^60 

A Summar\' of Alternatives ^ 63 

Appendices * . ' 

Tests and Use of Tests: NEA Conference on Civil and Human Rights 

' ; in Education, 1972 65 ' 

Report of the NEA Task Force on Testing, 1975 , . 81 

Contributors , ." 91 

Footnotes and References ^ 94 



**Girl 'number twenty /\said Mr. Gradgrind, squarely 
pointing with his square forefinger, **I don*t know that -girl. 
Who is that girlV' • , ' • 

**Sissy JupCy sir/* explained number twenty^ blushing^ 
standing up, and curtseying. : \' 

**Sissy is^ not a name/* said Mr, Gradgrind, ^*Don*t call 
yourself Sissy, Cdll yourself Cecilia, " • 

**Ws fat/ttr as calls me Sissy, sir/* returned the young 
girl in a trembling voice, and with another curtsey, 

'**Then he has no business to do it,** said Mr, Gradgrind. 
**Tell htm he mustn*t, Cecilia Jupe, Let me see. What is your 
father?** • ' . . , . 

**He belongs to the ^ horse-riding [the circus], if you 
please, sir, ** 

Mr, Gradgrind frowned, and waved off the objection- 
able calling with his hand, * . » 

*^We dor\*t Want to know any thing about that, here. You. 
mustn*ir tell us about that, here, your father breaks horses, 
don*t her* / ^ ' \ 

*Y/ you please, sir, when they can get dny to break, they 
do break horses in the ring, sir, ** 

**You' mustn*t tell us about the ring here. Very' well, 
then. Describe your father as a korsebreaker. He doctors sick 
horses, I dare say?**^ . 

"OA, yes, sir,** . < 

**Very well, then. He is a veterinary surgeon, a fafrier, 
and korsebreaker. Give me your defirjLition of a horse, ** 

^(Sis^y Jupe thrown into the greatest alarm by this 
demand,) 

**Girl number twenty unable to define a horse!** said Mr, 
Gradgrind, for the general behoof of all the little pitchers. 
*'Girl number twenty possessed ^of no facts in reference to 
one of the, commonest of animals! Some boy*s definition of a 
horse, ** . . .c 

• ' iie It: i: :1<: :t: 

**Bitzer,'* said Thomas Gradgrind. ''Your definition of 
a horse,** * * 

'^Quadruped, Graminivorous. Forty^ teeth, namely 
iwenty-four grinders, four ^eye-teeth, and twelve incisive. 
Sheds coat in the spring; in marshy countries, sheds hoofs, 
too. Hoofs liard, but requiring to be shod with iron. Age 
known by marks in rnoiith,** Thus (and much more) Bitzer, 

"Now,, girl number twenty, *i said Mr, Gradgrind, **you 
know what a horse is, ** 



—from Book the First, "Sowing*': Chapter Two, 
"Murdering the Innocents** of Hard Times by 
Charles Dickens (1854), 



WHAT'S WRONG WITH STANDARDIZED TESTING? 

by Bernard McKenna 



In the social sciences, economics is known as 
the dismal science. In education the ''dismal sci- 
ence" has to be standar4ized testing. 

• Itis history is ominous. 

• Much test content is uniiiiportant or* 
irrelevant* 

• The structure and formats of the tests 
are confusing and misleading. 

• The process of administering the tests is 
demeaning, ^ wastef^ul of time, and 
counterproductive. 

*• The application of statistics that result 
from test scores distorts rcalfty. 

• It is difficult, if not impossible,^to ensure 
that test results will be used either to 
improve student learning or to help 
teachers improve instruction. 

The paragraphs ithat follow develop each of 
these points. 

Intelligence and achievement testing began in 
the United Stages about the turn of the century 
and is closely associated with developments in 
France. The story is well known of how the French 
minister^of public instruction commissioned Alfred 
Binet to construct a test to identify students whose 
aptitudes wercsoip?w»that they should be placedlh 
special schools: Bitiettsoon found himself opposing 
those philosophers who supported the idea that 
intelligence is a fixed quantity. He said, "We mus't 
protest and react against this brutal pessimism."-' 

But the Americans who were influential in 
tM'inging the- Binet test to America, Lewis Terman 
of Stanford University and Henry Goddard of the 
Vineland Training School in New Jersey, espoused 
the **brutal: pessimism." Terman*s translation 
became the widely used Stanford-Binet IQTest. 

The U.S. Public Health Service commissioned 
Goddard to administer the Binet test to immigrants 
at the receiving station on Ellis Island. The test\ 
results Vshowcd" that 87 percent of Russians, 83 
percent of Jews, 80 percent of Hungarians, and 79 
p "cent of Italians were feebleminded. Conse- 



quently, the percentage of aliens deported for 
feeblemindedness rose by 350 percent in 1913. A 
history to be proud of? A record leading to enlight- 
enment? For shamel 

. The next gathering of destructive test data 
was during World War I when mental tests were 
given en masse to draftees. Analysis of these results 
immediately following the war resulted in their dis- 
criminatory use against Blacks— to * demonstrate 
that Blacks had lower IQs than Whites. /And so it 
.goes. B*Hween then and now is a history/of further 
"refinement" of esseritially Xht same content and 
formats, of the misuse and abuse of the same kinds 
of IQ tests that so destructively de?ilt with immi- 
grants and minority groups yi the early 1900's 
and during World War I. / - • 

The history, of standardized achievement 
testing is only slightly less dismal than that of IQ 
tests. Edward L. Thpmdike developed i\it first 
formal achievement tests in 1904. The main reason 
for achievement testing was not ^to assess student 
progress or improve teaching but to establish the 
profession of psychology as a science separate from 
philosophy. Never mind the students and teachers 
and their needs. The psychologists saw the oppor; 
tunity to be considered scientists if they came up 
with precise measuring tools with which to ply 
their trade. Thorndike wrote that "the nature of 
educational measurement is the same as that of all 
scientific measurement." ^And so the course was 
set, a course that has never been reversed: The 
evaluation of student progress would be considered 
in the same realm with measuring tolerances of 
automobile pistons or the trajectory of missiles. 

The near panic among the American public 
created by Russia's launching of the Sputnik in 
1957 led to vastly increased testing programs. This 
overemphasis on the use of tests resulted in several 
published warnings of the dangers of such pro- 
liferation. Testing Testing Testing, by a joint com- 
mittee of national educational associations, and' 
The Tyranny of Testing, by Banesh Hoffmann, 
were among them. But these warnings went un- 
heeded. And before the end of the 60's, evaluation 



guidelines of Title I of the Elemeritayy'and Scc6n^ 
• dary Education Act resulted in ev^n more stan- 
dardized testing/ \ 

By the early 1970's, the situation h^id become 
sp oppressive that warnings y^'ere once ag^iivjicr- 
aldcd. A national task force of the National Educa- 
tion Association and two substantive and pene- 
trating issues of the National Elementary Principal 
(March-April 1975, July- August 1975) were.ambng 
those sounding the alarm. Even as this article goes 
to press, the Reader's Digest carries a warning piece 
on the potential dangers of standardized testing- 
and a report out o£ London discredits a British 
psychologist's studie's of identical twins, a major 
source for the conclusion that IQ is .innate.. 

At the same time a movement cj^lled per- 
formance-based education calls for more testing, 
much of which is or promises to" become stan- 
dardized in one form or another. One is reminded 
of - the refrain, "When will they ever learn, when 
will they ever learn?'* 

•'Ralph Tyler's observation that standardized 
tests get "small answers to small questions" is apt. 
The content of the tests evaluates little more than 
the' ability to recall facts, define words, and do 
routine calculations. Obviously, not all these things 
are unimportant, but even in the reading and 
mathematics parts of these tests, many of the 
questions are inane. The mathematics sections 
emphasize mechanical calculation at a time when 
inexpensive electronic calculators are available to 
the general public. And the tests make almost no 
provision for evaluating a student's ability to 
estimate or to measure real things-important skills 
needed for functioning as workers and citizens. As 
one prominent mathematician has put it, *The 
concepts sections of most of the commonly used 
achievement tests suffer from the fact that they - 
trivialize the concepts." Large percentages of the"^ 
items in standardized tests, particularly in IQ tests, 
are limited to word definitions, all of which are 
learnable and tell little about students' general 
aptitudes. Further, the words to be defined are 
often obscure, infrequently used or encountered in 
reading, writing, and speaking. 

If the content of the basic skills tests and IQ 
tests is poor, that in the social studies is infinitely 
worse. For example, the social-studies part of one 
nationally prominent test reflects little of con- 
temporary curriculum change and improvement in 
this subject area. One review states that it totally 



' neglects "the art of discovery" and "process," both 
V very much a part of accepted teaching strategies 
today. And of science test i^ems,- a scientist- 
researcher says, "They are incorrect, misleading, 
skewed in emphasis, and irrelevant." ' 
' The content of. standardized tests emphasizes 
getting "right answers," almost totally neglecting 
the thought process by which the answers are 
arrived at. Interestingly, a recent Gallup poll 
indicates that a major educational concern of 
parents is that the schools help students think for 
themselves. 

Much else that is wrong with the substance of 
standardized tests can be only briefly cited here: 

• Test content does not reflect local 
instructipnal objectives dr specific cur- 
riculums. 

• Much, of , the content is unimportant or 
irrelevant to anything students need to 
know or understand. 

• Test content mcasuj^s mainly recall-type 
learning, neglecting, the higher thought 
processes---aaalyzing, synthesizing, and 
drawing 'generalizations and applying 
thcin to new- phenomena. 

• The tests'give an incomplete. picture , of 
student learning progress, because items 
that all or almost all students have 
learned are removed from the tests in 
order to keep the norming procedure 
statistically sound. 

• The test maker uses a language that is 
not commonly used in other activities in 
the real world. 

• Test items are unduly complex and re- 
quire too many different manipulations; 
sometimes instructions for the items are 
unclear. 

• Test vocabularies .and illustrations are 
often unfamiliar to those wjio are not of 
white middle-class cultures or for whoni 
English is a second language; that is, the 
tests are culturally and linguistically 
biased. 

The test formats are unimaginative, restrictive 
of creative thinking, and cpnfusing. The multiple- 
choice. mentality that is sometimes referred to in 
jest ("A, B, C, or none of the above") is more than 
a cliche'l Large numbers of items- in most stan- 




dardized tests arc multiple-choice. The assertion 
that students become more able to thinlc for them- 
selves by learning to respond to multiple-choice 
items offers a simplistic solution to a complex 
jproblem. In fact, ther.e is some evidence that the 
reverse is true. Because each multiple-choice item 
must appear somewhat plausible as an answer in 
order to minimize guessing; more than one answer 
can often be logically assumed to be right. This 
works particular hardships on those who think 
most creatively and innovatively. 

-Because of space limitations, test illustrations 
^ ' and pictures frequently are out of proportion: An 
eraser is about the same size as an automobile, 
houses are smaller than people, etc. 

The need for speed in taking the tests imposes 
an 'artificial structure that is not characteristic of 
real-life tasks. Obviously, students. need to learn to 
work rapidly and accurately. But in the real world, 
not much is comparable to answering 40 multiple- 
choice items in 60 minutes, or whatever. 

Standardized testing uses up inordinate 
amounts of precious instructional time. Thousands 
of hours go into testing that might better be used 
in individualizing instruction and planning for 
teaching. In terms of cost efficiency, the testing 
business rUns into hundreds of millions ot dollars, 
the results of which provide little or no help to 
students and teachers. 

Testing situations generate fear, imply mis- 
trust, and generally threaten and demean students* 
The emphasis on competition, the pressure of time, 
and the measures used to discourage cheating cause 
students to have lowered self-concepts, and to feel 
insecure and mistrusted. 

Testing settings are frequently physically 
intolerable: Time periods of testing are too long, 
instructions are blared out on public address sys- 
tems, and large groups of students are herded into 
cafeterias or auditoriums where they work on their 
laps. 

In spite of the evidence, the test makers say 
that there isn't much wrong with the content, 
structure, and formats of the tests. And while they 
admit that there are abuses in reporting, inter- 
preting, and using the results, they assume little 
responsibility for this. They argue that if Adminis- 
trators and -teachers would just interpret and use 
the results properly everything would be all right. 
Well, everything wouldn't be all right. 



Surely practitioners can improve test inter- 
pretation and usaige, but proper inte'rpretation and 
usage are almost unattainable because of the kind 
of substance and formats mentioned in the pre- 
ceding paragraphs'. It is nearly impossible to 
separate content and structure from usage— content 
and structure, jn latge part, determine usage. . 

Even if it werp possible to separate content 
arfd structure^ from usage, large problems of usage 
would still remain. Let us examine some of them. 

JThe standardization process in testing leads to 
reporting of results in terms of averages (norms). 
This distribution of scores along a range ensures 
that half the students will be below average no 
matter how well they do. Since there is nothing 
beyond subjective judgment to determine how- 
"good" average (or^above or below average) is, it is 
possible that "below average" represents good pro- 
gress on some tests and "above average" represents 
popr programs on others. 

On the matter of interpretation of results, a 
major fault with standardized testing is attributing 
to the findings much more meaning than they 
deserve-assuming that verbal and quantitative 
scores stand for generaljntelligence, for- example. 
Guilford and his associates confirmed long ago that 
the intellect has many dimensions, of which verbal 
and quantitative abilities are only a part. 

The statement of a prominent psychometrist 
that if she had just one measure of intelligence it 
would be vocabulary represents the kind of narrow 
point, of view about interpreting test scores that 
does disservice to both those who are tested and 
those who use scores to make decisions that may 
affect human beings throughout their lives. On the 
achievement-test side, a student's ability (or in- 
ability) to responjd "correctly", to more than half 
- the items on a standardized achievement test in 
biology or social studies tells too little of his or her 
potential in either of these subjects to be a basis 
for broad-range decision making. 

Yet decisions are made regularly on such 
narrow data, decisions that may limit or deny stu- 
dents' opportunities. On the basis of standardized 
tests results, students are categorized, grouped, and 
pigeonholed; placed in classes for the retarded; ex- 
cluded from particular courses of study; prohibited 
from pursuing advanced programs; barred from 
particular institutions; and even denied job oppor- 
tunities. And all this, sometimes on as small a basis ' 



ERLC 



9 



10 



as two or three wrong answers, answers to ques- 
tions that themselves may be highly qtiestionable. 

Even when test results are not used in formal 
decision-making processes, they affect practi- 
tioners' expectations of particular students. "Mary 
IS in the lower quartile. There is not much use 
spending time on her; she just doesn't have it," is 
an attitude that test results create. But Mary may 
"have it," and the reasons for the low test scores 
may have* been the particular testing situation, or 
Mary's physical or emotional situation at testing 
time. Or Mary may "have it" in many ways not 
evaluated by the tests. But since the tests them- 
selves create the impression that they measure% 
what's important or most of what's important, 
Mary may nt)t get much attention after scoring low 
on them. 

Decision making on the basis of standardized 
test scores goes far beyond the classroom. School 
administrators use test scores in comparing class- ^ 
rooms and school buildings and make decisions on ' 
^ programs'and personnel accordingly, school boards 
and legislatures use scores to determine die alloca- 
tion of resources, and the public judges the overall 
quality of education on the basis of the scores they 
read about in the papers. None of these uses is 
appropriate. All of them assume that the tests 
Indfcatc much more than any group-administered 
standardized test is capable of. 

Most important, for students and teachers,- 
the test results are too broad and general to pro- 
vide diagnosis of individual student learning prob- • 
lems, and they don't help teachers select the most 
appropriate teaching methodologies for individual 
students or groups of students. 

The schools and colleges of America should 
not use group-administered norm-referenced stan- 



dardized intelligence, aptitude, and achievement 
tests. As Jerold Zacharias, prominent physicist and 
professor emeritus at Massachusetts Institute of 
Technology, pointed' out in the National Elemen* 
tary Principal it is not sufficient to *'retreat to 
catch phrases like *I know these tests are not very 
good, but they are all we have.' There are many 
other ways to assess a child's general competence. 
/They may not look numerical or scientific, but 
they are known to every teacher and every school 
principal who reads this journal." 

Among such other ways are objectives- 
referenced (criterion-referenced) tests of which 
teacher-made tests are a part, individual diagnostic 
instruments, interviews of students tp determine 
their progress and learning needs, evaluation of the 
products^ of student work and their live perfor- 
mances, simulation, Contracts with students, stu- 
dent self-evaluation, and peer evaluation. 

Almost no one wants less rigorous evaluation 
of student-learning progress, if the American 
schools are to respond eff<:Aively to agreed-on 
goals and objectives, more and better evaluation 
procedures will be required. But one thing is cer- 
tain: Large-scale mass-administered standardized 
testing programs will not accomplish this mission. 

Most teachers are well aware of >his. They 
need to use their expertise, professional judgment, 
and influence with other'educators and the public 
to end such testing programs in their school sys- 
tems. And individually and collectively, they need 
to influence the testing industry, state education 
departments, and other groups to reallocate their 
large resources to research, develop, field test, and 
disseminate a broad range of alternatives to stan- 
dardized tests for evaluating student learning 
progress and to help teachers improve instruction. 



10 



THE LOOKING*GLASS WORLti.OF TESTING 

by Edwin K Taylor ^ 



Take a look at this multiple-choice. question: 

Scientists study three basic kinds of things- 
animals, vegetables, and 

. people . " > 

stars 

minerals ^ ^ 

« foods 
religions 

"Animal, vegetable, or mineral'* is a way to 
divide up the world in the game *\20 Questions." It 
has nothing to do with what scientists study. In 
fact, scientists study (among 6ther things) people 
and stars and minerals and foods and (if you m* 
dude archaeology): religions. The description of 
science implied but this test question is nonsense. 
No sense. 

That question is from a standardized achieve- 
ment test for elementary school children^ (Well 
mention later the meanings of standardized and 
achievement.) 

Now look at this question from another test: 

If '/2 of 6 is 3, then V4 of 8 is 

Never mind the answer (which is also pre- 
sented as multiple-choice): What does the question 
mean? //. . , then . . . usually means that one thing 
follows logically from something else. What is the 
logical connection between Vz of 6 and % of 8? 
There isn't any. No logic. 

Here is a third question from the same page of 
the same test as the preceding one: 

Different melons weighed 12 lb, 10 lb, 22 lb, 
15 lb, and 16 lb. How many pounds did the 
middle-sized one weigh? ^ 

Before answering the question, think of a 
cantaloupe or honeydew melon in a supermarket: 
What does it weigh? A small one, 2 or 3 pounds; a 
big one, 7 or 8 pounds. The question says "12 lb, 
10 lb, 22 lb, 15 lb, and 16 lb." Good grief, they arc 
all huge! None of them is middle-sized. They are 
unreal. No reality. 

No sense, no logic, no reality. That is the 
inipression you get from reading through test after 
standardized test. At first you think there must be 
some mistake, some one or two test makers who 



do a particulariy poor job. And some tests ar<? truly 
terrible. But\a// of them I have read arc at least 
bad. 

Test makers clcariy live in some sort of fan- 
tasy worid. That Would be all right by me except 
that, my childreit and yours arc judged by their 
standards. In ord^f to succeed on these important 
tests, our children hiust adopt their crazy logic and 
distorted view of reality. ♦ 

From the outside, the testing business seems 
useful, helpful, normal, un.d impressive. Most 
people want to know-how well their children are 
doing in school and how well their school is doing 
in comparison with other schools. Each test has 
been tried out with thousands of children (**stan- 
dardizcd") so one expects that all the bugs have 
been worked out of it. 

But the tests themsclycs arc secret, in the 
sense that parents and other publiq groups cannot 
examine and discuss them. And as soon as you 
look inside the tests, you realize that instead of 
being useful, helpful, normal, and impressive, they 
are none of the above. One feels like Alice in 
Through the Looking Glass, who stepped into the 
fantasy worid behind the mirror oyer her fireplace. 

Then she began looking about, and noticed 
that what could be seen from the old room was 
quite common and uninteresting, but that all the 
rest was as different as possible. For instance, the 
pictures on the wall next the fire seemed to be all 
alive,- and the very clock on the chimney-piece 
(you know you can only see the back of it in the 
Looking Glass) had got the face of a little old man, 
and grinncH at her. 

In this chapter we take a very brief stroll 
around the looking-glass worid of standardized 
achievement tests. (Achievement tests examine 
what you know or do, as opposed to aptitude or 
intelligence tests which examine-supposedly- 
what your potential for learning is.) To keep the 
story simple, we will quote examples only from 
tests for elementary and junior high school stu-. 
dents (for children up to age 13 or 14). 

As you look at one of these test questions, do 
not congratulate yourself for Knowing the "right" 



11 



1Z\ . 

\ ' 

"answer: ihat is to be trapped behind the lookjng 
^ass. InstVad, think about the logit and reality of 
*the question itself, the number of different ways it 
can be interpreted by 'children from a variety of 
backgrounds^ how many of the given multiple- 
* choice answ^ could be correct, and where a child 
must look out for a trick, a trap, or a simple mis- 
Uke by the test maker. ^ 

Two of the questions that began this chapter 
are examples' of Iqoking-glass arithmetic: the 
manipulation of numbers. But numbers themselves 
become weirdly distorted in standardized tests. Try 
this question: 

How many hundreds are in 20 tens?. 

Never mind the answer itself: What possible use 
will the answer have^ Docs -apy scientist^ dpctor, 
lawyer, shopkeeper, -or homebwncr need to know * 
how to imswer this question? The test makcriwijl 
mention something about "place valuif," which 
means that children should realize that 20 + i 
equals 21 and not 30. But if children have this kind 
of trouble, you help them with that rather than 
teach them some jargon. 

Apart from its uselessness, the question con* 
tains a linguistic trap. Since 20 tens equal 200, 
therefore there are two hundreds in 20 tens. So the 
answer is 200, right? Wrongl But never mind. 

Here is ano'thcr question about numbers, in 
fact the number zero. * 

36. Which of these are names for zero? 

I. 0+10' 
ir. 0 X 10 
III. 0^10 
A, II only 
E. I and II only 

C. 'II and III only 

D. i, II and III 

First of all, what 3oes "names for zero" 
mean? I know four names for zero: null, coid, zip, 
and zilch. None of them appears among the 
answers, so try again.. Apparently 0 x 10 is a name 
for zero. This name for zero is called Roman nu- 
meral II. Another name for zero is called III. The 
answer is "II and III." This answer is called letter 
C. In order to answer the question the poor child 
has to keep in mind simultaneously all these names 



6 

ERIC 



and names forenames. He' or she may feel like Alice 
when the White Knight explains the names for his 
song: / 

"The name of the song b called 'Haddocks* 
Eyes,''' 

"Oh, that's the name of the song, is it?" AJicc ' 
said, trying to feel interested. 



The Oripnal Locking-Glass Achievement Test 

"Can you do .Addition?" the White Queen 
asked.^ **What*s one and one and one arid one and 
one and one and one and one and one and one?'* 

"I don't know," said Alice. **! lost count." 

"She ca'n't do Addition,"' the Red Queen 
interrupted. "Can you do Subtraction? Take nine 
from eight." 

"Nine from right, I ca'n't, you know," Alice 
replied readily, **but— " 

. "She ca'n't ^ Subtraction," said the \Vhitc 
Qiiecn. "Can ,yoU do Kvision? Divide a loaf by a 
knife— what's the answer to that?" * 

"I suppose-*" Alice was beginning, but the 
Red Quc^en answered for her. **Bread and Butter, 
of course. Try^nother Subtraction sum. Take a 
bone from a dog: what r^ ns?" 

Alice considered *The bone wouldn't remain, 
of course, if I took it-and the dog- wouldn't re- 
main: -it would come to bite me— and I'm sure / 
shouldn't remain!"- * * 

"Then you think nothing would remain?" 
said the Red Queen. - . 

"I think that's the answer." 

"Wrong, as usual," said the Red Queei.. ^The 
dog's temper would remain." 

"But I don't sec how-" 

"Why, look here!" the Red Queen cried. *The 
dog would lose its temper, wouldn't it?" 

Perhaps it would," Alice replied cautiously^ 

^*Then if the dog went away, its temper would 
renuun!" the Queen exclaimed triumphantly. 

Alice said as graveiy as she could, **lhey 
mi^t go different ways." But she couldn't help 
thinking to herself, **What dreadful nonsense we 
are talking!" 

"She ca'n't do sums a bit!" the Queens said 
together, with great emphasis. 



12 



• "No, you don't understand/* |he Knight said, 
looking a little vexed. ^"That*s what the name js 
called. Jhc^name really is The Aged, Aged Man. * " 
, "Then I OMgHt to have said, *That*s what the 
scng is called'?'* Alice corrected herself. 

"No, ycu oughtn*vrthat's quite another thingl 
The song ;^ called *Ways and Means';- h\x\. that's 
only vAidX it's caWed, you kriowl" 

; "Well, w^iat is the song, then?!' said Alice, 
who was by this tijKe con7')letely bewildered. 

* "I was coming to that, "-the Knight said. *The 
soijg really is 'A-sitting on a Gate'; and the tune's 
my own invention." ' . 

Here is an example .of what my colleague 
Judah Schwartz calls **A is ro B as C is to almost 
anything"; ^ , - 

PuUnrjan was to railway cars what— * ^ 
- " X Whitney was to'oii 
- Goodyear was to rubber 
^ " Jefferson was to cpttcn 
* Boston was to beans 
^ donjt know 

Since there is no unique relationship between 
•different kinds of things (such as a person and a 
product), the item asks, in effect, "What am I 
thinking?" The result is to penalize inventiveness. 
Boston produced beans just as surely as Pullman 
produced. railway caVs. Tests are full of this kind of 
questionv^particularly the college^ entrance exami- 
nations. 

In no field is the unreality of the test maker's 
yorld more apparent than in science. Here is a 
looking-glass question about mirrors: 

What does this picture of a boy looking at 
himself in a mijgpexvillustrate? 
' -r-focusing • 
—transparency 
--dispersion 
—reflection 
-^•[don't know] 

This is one of many, many examples, of a 
multipje-choice problem in which all the choices 
are correqt. The picture.of a boy looking at himself 
in a mirror illustrates focusing on his eyes (and 
ours!); it illustrates transparency of the glass; it 
illustrates color fringes due to different speeds of 



light of different w^ve-lengths in the glass (called 
dispersion- the original figure is two-color with 
blue and black, so is "in color"); and it certainly 
illustrates reflection. If I cannot choose one among 
these correct answers, will I be given full credit for 
choosing the answer "don't know"? / 

Along with "content," the enterprise of 
science itself as* pictured in achievement tests is 
seriously distorted. One example began this article. 
Herfe is another one: 

Which method is used by scientists to discover 
new facts? 

talking and listening ' , * ' 

reading and writing . 
revising and antendirig - ' - ^ 
experimentihg'and observing 

*i ' * 
Whai does facts mean? Experimental data? 
Then clearly "experimenting and observing"^ is the 
correct, answer. But experimental data are not 
"discovered" as some^kind of surprise: They are 
recorded as the result of planned experiments. 
Maybe ^'new facts" means "new theories.''. New 
theories can be discovered, but how are tlffcy 
^discovered? Under what circumstances have you 
'had new ideas? While talkiiig or listening or reading 
or writing or revising or amending or expcj;i- 
menting or pbserving? Yes! And while dozing or 
"waking or sitting or walking or bicycling or. , . . In 
truth, this question seriously misrepresen"ts the^ 
enterprise of science, fft order to answer the" ques- 
tion at all, the child must adopt the fantasy world 
. of the test maker. 

Arc all test items as bad as the ones we have 
shown? No, but a. significant percentage are, the 
percentage being greater or smaller depending on 
how you set your sights. Banesh Hoffmann, author 
of The Tyranny of Testing, h^s a standing offer for 
test publishers: On any standardized achievement 
test not concerned merely with trivial facts or 
routine arithmetical-operations, he guarantees that 
reasonable people will agree that at least 10 per- 
cent of the questions ace seriously faulty. 

He is clearly being conservative; It should not 
be difficult, to find significant faults with 20 per- , 
cent of standardized test items. Indeed, if one is 
allowed to object on principle to crowded graphic 
layout, a separate answer sheet, or the inmltiple- 
choice format itself, then the failure rate for test 



Questions themselves can approach 100 percent. 
But even if only 10 percent arc faulty, this con- 
' stitutes a serious indictment of these tests, since a 
variation of 10 percent in number of "correct" 
answers can oftentimes determine whether a child 
is placed at the top^or in the middle of his or her 
"reference -group." 

\Vhy;^re achievement tests so bad? I believe 
that the primary reason is the test ma)cer's goal of 
lining up "children along a single line by asking, 
"Who has 'ti\^e higher score?" The inhumane notion 
that {jeople can and should be compared with one 
^ another along a line is the fundamental error that 
leads to the looking-gl^s worid-of testing and its 
, perversion of our educational system. 

This notion also leads to the brainless \xsc of 
statistics in the development of tests> The stan- 
dardized lest is constructed initially hy selecting 
qucstiefcs from* a large reservoir composed by, 
"item writers." The preliminary ^version is th^' 
tried out w^th different groups of children, each 
group large enough to provide "statistically sig- 
nificant'^ information on whether or not each test 
itein discriminates between children in the way 
^that the test makers -wish to discrimiiiate..iTypi- 
cally, a revision of the test is tidied out with a large 
selection of children in order to "standardise" the 
^ i-esults for different groups. 

/The lest items thai survive this selection 
process arc those that make the "appropriate" dis- 
criminations between children and not necessarily 
those that are logical, correct, or cleariy laid out, 
'or that actually test the skills that society holds to, 
be important. 

This entire* process of test development can in 
principle take place without any child's sitting down, 
with a.sensitive adult to try out the questions and 
discuss which of the difficulties arc important and 
relevant and which are" trivial, irrelevant, qr caused 
by the form or layout of the test itself, ijntil test 
.makers get a lo^ closer to rcaMndividual children, 
the children wlib take their tests have the terrible 
choice between remaining real (and failing) arid 
becoming part of the ^test maker's dream (and 
losing their own reality). 

If we know how tests con\e to be as bad as 
they are. Why do they remain so bad? I believe that 
the.answer is summarized in one word: secrecy. So 
much time and effort go into trying out each test 
item with- large numbers of I children that it 
becomes a valuable property ij/its own right. To 



make such an item public is to destroy its power, to 
compare children with one another. The result is 
jthat parents as a group cannot see the .tests by 
whiph their children are judged. Until parents and 
teac'hers can compare notes and seek advice on 
tests exposed to the light of day, there will be no 
opportunity for their natural outrage to lead to 
tests improved in content, humaneness, and 
^ronnection to the real world. 

What' shall we make of all this? Shall wb laugh 
or shall we cry? In our outrage shall vvc demand an 
end to^all achievement testing? Some parents,, 
teachers organizations, and school boards may 
decide so, and their choice should be respected. 
Others will continue to feel tha^t children and 
teachers need to know how well they are doing and 
that schools need to report to parents and other 
taxpayers how well children have mastered the 
skills that society thinks essential to its proper 
operation. In order to'do this task humanely, test 
development and use must be altered funda- 
mentally. . 

The first and, .essential step is to stop com- 
paring one child with other children (so-.called^ 
"noun-referencing") and instead to try deter- 
^ mining whether a child performs the necessary 
tasks well enough ("criterion referencing"). The 
best example from the adult world is the auto- 
mobile driver's test:. The driving skills, you are 
expected, to demonstrate are not secret, and you 
cither do weH enough now or you have to try again 
later. 

SecoiKl, test developers must sit down with, 
children individually, watch them take the test, 
and talk w^th them afterward about which ques- 
tions were clear and important and which were 
confusing or demeaning. The children who try out 
the tests must be from diverse ethnic and cultural 
backgrounds, both because tests must not dis- 
criminate on these bases and also because all 
children wijl benefit from the use of tests that are 
made understandable for as wide a variety of 
, childjren as possible. 

Third, the usefulness of tests must be judged 
by how soon after completion children and teach- 
ers know which answers are in error and what mis- 
understandings may have resulted in a given in- 
correct ansvycr. 

Fourth, for tests of "practical skills" such as 
classifying, describing, narrating, measuring, 
estimating, graphing, mapping, and doing word 



14 



15 



problems, test makers must show jhat performance 
on the test compares with ability to carry out simi- 
lar tasks in settings as near to real life as possible. 

Finally, when skills can be clearly related to 
test performance, parents, teachers, and adminis- 
trators must speak tor society by deciding what 



level of performance on each test shall be called 
"good enough.'* As a chbck on this process, tests 
must be made public^ at least after they have been 
given locally. 

It's a long road back through the looking 
glass, but some of us arjp starting down it. 



-> 



15 



9 



16 



THE WAY IT IS 

by Charlotte Darehshori 



One of the ' main goals of education is to 
implement humanistic programs, in our schools. 
Yet incorporated in these programs as one of the 
evaluative tools is one of the most dehumanizing 
practices in educatfon— stan(^xrdized testing. 
. ^ While most of us talk in terms of indi- 
vidualized Approaches, ive employ tests that are 
constructed to compare child with child, class with 
class, and school with^school. We use tests that not 
only give us ^ basis for comparing children, but are 
purposefully built to ''fail" a certain percentage of 
them.. \ ^ > 

We tell parents not to> compare their child 
v^tfr peers or siblings because this could be 
damaging to the child's self-concept; vye tell chil- 
dren not to compare themselves with others. How 
then can we justify our practice of using stan- 
dardized tests that make just such comparisons? 
" As a teacher, I have found it harder and 
harder t9 justify standardized testing philo- 
sophically, but it^ is even more difficult to justify 
the cruelty o{ subjecting young children to the act 
of testing itself. " ^ 

In giving standardized tests we place children 
in ^positions' over which they have no control, then 
we direct them to perform illogickl tasks and to act 
as if everything were perfectly logical. , 

Taking a. standardized test is a bizarre expje- 
rience for beginning first grade students. Its 
scenario comes complete with written parts for 
both teacher and student: For the first time in the 
children's school careers— perhaps in their lives^ 
they are interacting with an adult , who is reading 
from a script that dictates what, how, and when he 
or -she will react to th#m. In thiis play, which is 
only Ibo real (its results will follow the children 
throughput theij school careers and influence the 
way some people think of them), all human needs 
are put aside , when the children and the teacher 
step into their roles. 

The children have had virtually no practice 
for their role. The teacher, in contrast, carries the 
script around and reads from it word for word. 



Going through tl^is performance begins a 
dehumanizing process for student and teacHer 
alike: Witness this typical testing scene in a first 
grade classroom in September 1975. 

. ..Meet Melanie, a first grader. She .is bright, 
somewhat shy, but loves school. She has begun to 
make friends, participates in class activities," and' 
seems to be starting a successful' school career. 

About the third week of school comes testing 
day. Melanie walks into ^er room, where th^desks 
have been pushed apart a«d placed in straight rows 
to prevent the children frop seeing each other's- 
papers. 

She sits ^at her desk. The teacher gives each 
child, a test and a Number 2 pencil and. tells the 
children to work on their own.. If they don't know 
an answer, they are to mark^the one they think is 
best. The test begins. 

Melanie has no misgivings about this test. Her 
teacher has never placed her in a failing situafion, 
so she trusts her completely. • , 

'-The first item on 'the test has a picture of a 
tub. The teacher reads, "Find the lettt'r that has 
the sound you hear at the end of tub, " 

Melanie begins to feel uneasy. The. test looks 
different from the practice test she had yesterday. 
There are lots more funny looking letters and 
arrows and bubbles to mark in. 

"Find the letter that has the sound you hear 
at the end of tub, " the teacher says again. 

Just as Melanie starts to becpme frightened, 
she sees that one of jhe letters has already been 
marked in. \ 

The teacher reads, *|Look at the picture of the 
stamp on the other sidjs of the page. This time 
listen to the sound you - 
stamp:' , , 



hear at the beginning of 

Melanie sees a pictu e of a s/amp. Beside it are 
three arrows. with bubblfes^n^^^ is beside one of 
the bubbles;^ a c/, besideUnother; and a.6/, beside 
the last.^ ^ 



16 • 



17 



Her stomach begins to feel funny; she holds 
^{he pencil more tightly.. 

The teacher ^ goes on, "You should have 
marked the j-^. You hear the sound that5-^ makes 
at the beginning of stamp. You do not hear c-/ or 
b4. You should not have marked these.'' 

The teacher's aide is walking around the room 
looking at papers. She comes to Melanie. "Melanic, 
do you understand how to mark your ansVvcrs?'' - 
^Melanie looks up at the aide. **! know I'm 
supposed to mark in one of these circles. We did 
that yesterday, but I can't tell which one to mark." 
y-^--^ "Just take a good guess and go on." 

"But, I don't know, Lcan't read yet." 
The aide pats her on the shoulder^ ''just do 
the best you carli." 

The; teacher goes on, **\Ve are ready td begin. 
If you do not understand what you are to do,.i;iise 
your hand. If you are not sure of an answcr,y^urk 
the one that you' think is right. If you change your 
answer, erase the wrong one. If you wani me -to 
- repeat any question, raise yoVr hand." 

By this time, Melanie and most of the other' 
- children are so confused by the maze o( instruc- ^ 
tions, they can't even formulate a question. 

Since no one raises a hand, the teacher con^ 
tinues, 'Tirst we are going to listen-^Jor sounds, at 
the end of words. Is everyo.n'e ready for Number 1? ' 
Look at the picture of the drum. . . .^Mark y6ur. 
answer." / / 

/ Melanie stares at her-.paper. She doesn't kjiow 
.what the teacher is asking her to do. She jboks 
ar<^und/ feeling panic. Since many of the children 
now have their , hands up, she puts hers ub.^he 
aide finishes with one of the- other children and 

' comes to her side. "I don't know what to do," 
Melanie whispers, tears- ir\ her eyes. The aide can 

> ^ , only repeat what th:§Jeacher'has^said. 

/ The teacher goes through 2-1 more items, in- 
cluding ones in which the children have to be able 

' distinguish between t% u, and i as the sound 
heard in the middle of first, and ii, o, pr e as the 
sound heard in the middle oirug. 

Everyone greets .recess with cheers: The chil- 
dren are exhausted;: the aide and the t^eacher are 
. exhausted. ^ 

/ \ ^ After I'cccbs, the children come back into class 

' and see the test bpoklets still-on their.desks. They 
groan and prptest. The teacher gets them settled 
down and begins the routine again. ' 



i In a situation like the one above, the cjijl|fren 
teifid to feel that they are failures; they"n|ver 

^ suppect that something may be wrong with tfie 
tejt. The teacher, too, is a victim in this testing 
process, because he or she is made to feel that any 
problem in carrying out the test is cauised by the 
ujuy he or she has administered it. According to the 
tv^sting manual, "the (cacRer or examiner who / 
Intakes the announcement should guard against ' 
in-6using anxiety in the students^" • 
I During testing ,wec<k some children remove/ 
themselves front the intolerable situation by either 
''playing sick" or actually becoming sick. . 
}-~ On the second day of testing, Melanie did not 
want to come to school, butcher mother felt it was ^ 
^^imporjant for her to go and "not get in the habit 
*of Staying home just ,becj)use something she didn't 
like was happening." In this way Melanie's mother, 
; like many other parents, helped to support the 

j practice 6f testing, feeling that it is a necessary evil. 

/ The parent thus joins with t)}c schooj in further 
convincing the chifd that something is wrong with 
the child, not with the test. 

Melanie, However, had an astlifna att\ick dur- 
ing the math portion of this test and got to go 
.home anyway. JDuring the rest of the year, she was 
frequently absent and very reluctant lo try new 
tasks. - , ^ 

^ In the second grade, the same pattern con-^ / 
tiriued: Melanie^s experiences with testing seem to 
have changed what started as a positive schqol 
experience into a negative one." \' * 

Unfortunately, this student is not unique. ^ 
Two or three days of testing frequently damage the 
self-esteem of many first graders. It is hard to over- 
state the negative impact of this test on young chil- 
dren. 

Other children deal with staodardized testing 
by not really trying, by just marking answers and 
going through the motions. On the'ireading com- 
prehension part of the test given above, the chil- 
dren wrre required to read sentences suclvas 'The • 
prince took a drink and changed into a fro^." Only 
two children in this class were able to read at all. 

The children were given 15 minutes for this"^ 
part of the test. Most went through it marking any 

' -bubble-tba^-*m«ik4iuujJanc3^^ finished the test 

in two or three minutes. Some made nice designs 
with the bubbles. Only the two little boys who 
could read took more than five minutes for the ' 
..test. One of them became frustrated because the 



17 



18 . 



teacher wouldn't help him with a word, so he put 
his- Jiead down on his desk and refused to finish the 
tesU \ ^ 

The effects of tests on children are tragic and 
cruel. The' vicioas cycle of labeling antl testing fol- 
lows children throughout their school experiences, 
influencing both teachjcr and parental attitudes 
toward them—and, what is worse, their attitudes 
toward themselves. 

. Much has been written about the effects of 
testing on teachers' attitudes toward students. We 
must now contend with a tliirS* party in this un- 
healthy situation. Federal and estate programs re- 
quire increased parent participation, so parents 
have access to information which they might not 
otherwise be aware of. ■ ' " 

Usually parents whose child has low scores 
believe either the child or the school is falling. 

Teachers who know the* limitations of . these 
scores are" reluctant to tell parents a first or^.second 
"grade .child is ranked **belo\V average." Providing 
this information to the parent only perpetuates the 
labeling of young children; We must question, how- 
ever, an5^ use whatsoever of a score that is so 
tainted that we. wish to withhold it from a child's 
parents.. If a score is that misleading and damaging 
in its effects, we must examine the wisdom of even 
having it available. 7 , 

We must^also question the educational sound- 
ness of writing objectives based on raising scores on 
^standardized tests. Suppose, a school gets govern- . 
nient money for a program to bring all students' 
scores that are in the Tower two quartiles up to the . 
upper two. The tests are^onstructed, however, to * 
obtain a certai^ distribution of scores among all 
four quartiles. The two lower quart iies will by 
definition always contain a certain proportion of 
students* scores, so the programs are destined to" 
fall short of their objectives. \ ' ^ 

It is difficult for a teacher to\havc worked 



scores that says our students, are failing acadcm- 
fcally. ' 

These tests also negatively affect the programs 
that they evaluate. In schools where the staff is 
professional and secure, the influence of these tests 
is minimal as the staff tries to keep the children's 
needs in mind and teach to these needs, not to the 
tests. Even so, the need to^compare skills^achieve- 
.ment wi.th tliat in other schools gives the tests 
influence. Because evaluation techniques and stan- 
dardized^ tests 'Tiave not kept pace with curriculum 
development and theories of child development, 
that influence is regressive. 

In other scl;iools, the situation Is worse.. At 
one school where I taught, great emphasis was 
placed on the test results. Predictably, teachers did 
everything possible to improve the test scores. 
Since the only two areas evaluaied'on thcvtest were " 



hard all year only to get the results of standardized 
tests and find that, technically, the teacher and 
class both have failed. This year nearly every child 
in our school is in the lower two quartilts in read- 
itig or math or both. Since the main goal of the 
program at our school is to bring these Children 
into the two top quartiles, the, program haV failed. 
, J The teachers and staff of our .scho^ can 
accept this failure intellectually because;^ we feel it 
is only a "paper failure." Emotionally, however, 
we become frustrated whert faced' with\a list of 



math and reading, teachers copcentrated^ on these 
two areas almost to the exclusion of ait^^jsocial 
studies, and music. Recess and lunch time v^cre cut 
down in order to give more instructional time in 
math and reading. Testing was m?inipulated to 
ihake the pretest scores Hpwer than" the posttest 
scores. For pretesting, all ^iTir-tests were given in 
one day, on a Monday; for posttesting, they were 
given at a more leisurely pace on Tuesday, Wednes- 
day, and Thursday-days when the children were 
usually more settled. • ' \ 

Tests and work sheets covering the material 
on the test finally came to be the curriculum at the 
school. The pressure to look good oij tests brought 
about wid^ fluct^iations in students.' test scores—' 
gains of two or three years one year and regression 
the nextt 

It seems, then, that little of \alue is derived 
^^rom these tests, other than using the scores as 
k criteria for deciding which schools will get federal 
money for new programs. (Why not throw darts?) 

To student, teacher, and parent, the tests are 
equally devastating. One teacher at William Penn.^ 
Elementary School (Bakcrsfieldb California) put it 
very succinctly, "How do standardi'/ed tests help 
me in the classroom? Well, they helped three chil- 
dren ruin their pants and one child have .;n asthma 
attack." 

Teache^Ks have talked about the damaging 
effects of standardized tests for years. Perhaps; if 
they refused^p give the tests, changes and reforms 
wou^d result. >y / 



One immediate change could be to exclude 
^ children ^ from standz^rdized testing until they 
actually have the sjcills that these tests arc 
supposed fo be testing. Teachers C9uld use their 
judgment to decide who should take the tests. 

Because these tests are not diagnostic and are 
supposed to be more valid (although this, too, is 
questionable) for a group than for an individual, 
test results should no^t be linked with an individual 



student but only with a group. 

On a long-tejrm' basis, test manufacturers^ 
should design -.tests based on the developmental 
levels of young children— not adults. In cuiriculuni 
we realized years ago that the child is a unique 
kind of being an(jt-not just a smaller version of a 
grown-up. Merely updating the old model pf the 
standardized test as testing companies have, done in 
the past and continue to do is not enough. 



ONE WAY IT CAN BE 

by Brenda S. Engel 




(Jpril the spring of 1976,- the Cambridge 
[lative Public School, then in its fourth year, ^ 
^ generally avoided administering the- stan- 
dardized tests ordinarily required of Cambridge 
public schools. At that time, however, pressure 
from the school department was increasing; the 
ass'jslant superintendent for elementary education 
fel&r-that he needed concrete evidence oY the 
qu^ity of the education offered at the school. - 

With. parent support, the school had taken an 
antitesting position (siniilar, on several points, to, 
that taken by the NEA). The school felt that 
scheduling standardized^, tests disrupted the educa- 
tional process, that the tests made many children 
anxipus, that the tests penafized minority children^ 
and tl^at their influence on teaching and the cur- 
riculum, could be disastrous for an innovative 
school. But the school community (teachers, 
administration, and parents) also had. a str6ng 
interest in carrying out some form of evaluation in 
order to corroborate their confidence in. the 
school. So much for the situation. 

At this point, at the reique'sl of a parent-staff 
committee, I was employed as an independent 
consultant to ' try out some means of evaluation 
that might be satisfactory to both the school com- 
munity and the^school department. 

We settled on the third grade^for the alterna- 
tive evaluation, because, it was a well-balanced 
class in regard to age, sex, and face. Four teachers 
would be directly involved since the 29 children in 
the grade were fairly evenly divided among lj>ur 
classrooms (each contained mixed ages). Most of 
the children involved had been in tjhe school from 
its inception. y ^' 

In order to keep the size of the undertaking 
manageable during the first experimental year, we 
identified three areas of the cuff iculum for assess- 
ment-math, reading,, and art— and proceeded to 
make an overall plan, to outline an implementation 
schedule, and to design the actual instruments of 
evaluation. ^ 

The evaluation was to be ca!rried out over a 
five-week period' toward the end of the school 



year. We hoped the instruments of evaluation 
would do the following: 

' ' / * 

• Give each child various ways to demon- 
strate his or her abilities. 

• Take into ^ consideration the varied 
economic, cultural, and linguistic back- 
grounds of the children. 

• . * jElicit original responses and creative 

thinking. ^ v • 

• .Assess significant aspects of education. 

• ^ Gain information about children's learn- 

ing as directly as possibk.. 

We also hoped that the evaluation would 
cause a minimum of disruption in the school and 
that it would not be a negative experience for the 
children. The actu^ work of the assessment was to 
be shared amohg a^'number of people with different 
jobs ip the school or with different relationships to^ 
it. ' . 

When the evaluation was ^completed, we 
; hoped to present a report in^clear, readable, and 
^ usable form. We planned to make it more descrip- 
tive than judgniental, both non comparative and 
nonnumerical, and useful to teachers as well as in- 
formative to administrators and parents. 

The matrix shown in Figure 1 describes what 
we intended to assess and how,y/e intended to do 
it. The Areas to Be Assessed are listed across the 
top and the Me.ans of Assessment Sown the left- 
hand side. Teacher statements led the list ^f Means 
of Assessment. Each teacher gave opinions (which 
we determined through lengthy interviews) of each 
child's progress in each area of learning: hisjor her 
understanding of the decimal system, sense of 
estimation and of probability, and so on across the 
matrix, ending with the child's ability to solve 
original problems. , ^ ' ^ 

The teacher statements began the evafuation 
process and supplied the guidelines, in both con- 
tent and approach, for conducting the re^t i)f the 
^assessment, the/ teachers' opinions of each Child's 
ability in each curriculum area set the stage for 



20 




22 



22 



what followed, particularly for our observations of, 
and interviews with, the children. 

Thivsecond Means of Assessmcnt'-classroom 
observations—was necessarily open-ended and 
directed more toward quality of work and involve- 
ment than toward skills. An observer spent about 
half a day in each of the four classrooms, focusing 
particularly on the third graders and recording the 
observations in anecdotal form. 

A parent committee drew up and distributed 
parent questionnaires, the third item in tKe Means 
column. The areas checked on the matrix represent 
only some of the subjects covered in the ques- 
tionnaires; other subjects were matters of general 
interest to the school and were not part^jof the 
assessment. > 

The oral ^nd written tests .were 'made up 
specifically for the occasions (i.e.^ nonstan- 
dardized). They were to inventory the children's 
abilities in the specified areas as simply ^s possible^ 

Following is an example from such a test. It 
was designed to nieasure children's ability to 
estimate as part of the mathematical skills assessed 
on the^'matrix. 

About how much does your teacher weigh? 

Which do you think weighs more, a bicycle or 
ahorse? / . ^ 

About how long is your thumb? 

About how high is the ceiling \r\ this room? 

About how long does it take you to brush 
your teeth?" 

About how long will it be before , you are a 
grown-up? » 

About how long is summer vacation? 

About how n}f2J\y children arc there in this 
school? 

About how many pieces of bread are there in 
a loaf? 

Classroohi teachers, with the help of graduate 
students, gathered the next items on the list- 
collection of work samples (current), previous test 
results, and summaries of school records. 

^ Finally, when all these data were assembled in 
folders, we conducted^ an interview wiih each child 
to fill in any gaps in the information and clear up 
possible ambiguities or contradictions. 

Most important, each area of leamifig was 
examined in a variety o£ ways designed to cross- 



check each other. No judgments were made on the 
basis of a single means or single occasion. 

Another, central and challenging consideration 
was the form of the final report, which Had to be 
designed for the requirements of wiSely different 
constituencies: the school department, school 
administration, teachers, parents, and children. 

The school department was interested in a 
concise statement focused mainly on skills. Par- 
ents, although .they varied in their expectatioi\s of 
the evaluations (some lookirs^ for cognitive; others, 
affective as::essment), were primarily interested in 
detailed reports on their own children. Teachers 
were looking for confirmation of their own percep-* 
tions, for further msights, and for implications for 
the curriculum. The '.school administration shared 
all these interests. The children themselves, if they 
were at all aware of the nature of the process, were 
looking for personal reassurance. 

Our reporting system ha^ three parts: a^sum- 
mkry sheet for each child, a key (with school 
department expectations, not norms, underlined), 
and documentation (folders containing resultSvOf, 
or notes on, all the Mea,ns of Assessment). By 
glancing at only the summary sheet, one could gain" 
a general impression of achievement; one could 
read the summary sheet in detail, using the key to 
identify specific skills; or one could scrutinize the 
actual evidence on which the summary sheet was 
basdd. ' 

It would be misleading to suggest that all of 
this does not add up to a irubstanti^al amount of 
work. Haviijg gone through this process once, how- 
ever, those of us involved now believe that the 
same approach could be carried out in a variety of 
ways and over differing lengths of time. 

A teacher, or group of teachers, could custom 
design his or her own ^matrix, listing^^the subject, 
matter to b<; assessed across the top (as shown in 
Figure I) anitthe feasible means of assessment down 
the lefthand colujnn. The' matrix itself, once it has 
"been filled out, can provide the framework for the 
' assessment process. For instance, one might limit 
the means to teacher statements,' parent question- 
naires, written tests^ and work samples. Similarly, 
one could limit the areas to be scrutinized. (It is 
important, however, to assess each area in a( least 
three ways.) Later, then, when the time comes to 
write a test or plan a questionnaire, one has only to 
look across tlie horizontal row from a particular 
means to identify its content. Classroom teachers 



23 



23 



can formulate the specific question^ to be ^asked 
without much difficulty. 

' After the scheduled information has been 
collected for each child, the teacher can fill in a 
report for each child, viewing the assessment as a 
more-than-adequate substitute for the usual reports 
and tests— not ss an addition to theni. 

How io ^css the assessment in relation to 
our expectati'^ns? At this point, it is important to 
reerhphasize that the purpose of this alternative 
evaluation was to compile as informative and as 
comprehensive a picture of each child's abilities 
and skills as possible~-not to compare children's 



achievements or rates of growth with those of 
other children* In this context, the findings have 
promise. 

The children, by their own' accounts during 
th'jir interviews, seemed to enjoy the process, 
which was neither seriously interruptive nor 
damaging. With the^ additional specific information 
about each student gained from the assessment, 
teachers felt that they could do a better job of 
individualizing the educational program Sot each 
child.-Pcrhaps most important, the assessment itself 
did not violate ihe 'educational climate we were 
trying to protect and to which we were committed. 



24 



ROLES AND RESPONSIBILITIES OF GROUPS CONCERNED WITH STUDENT EVALUATION SYSTEMS 

' by Sertiard McKenna 



The roles and responsibilities delineated 
below for specific groups of persons particularly 
concerned with student evaluation are based on 
findings of and positions taken by the NEA Sask 
Force on Testing. (The report of Uiis task force is 
contained. on pages 81-90.) These recommended 
roles and responsibilities arc considered essentials 
for achicvinjg the following goals: 

• Sound and fair development of evalua- 
tion systems * 

^ • Appropriate distribution and adminis- 
tration of evaluation systems 

• Accurate and fair interpretation of the 
results . . 

• Relevant ^nd constructive action pro- 
grams bascd.on the results. ' . 

* * 

A. Teachers, Individually or Collectively 
Through" Their Associations, as Appropriate, 
Should Do the Fbllowing: 

1. S<eck representation on school district, ^ 
testing industry, and government (state 
and federal) decision-making groups for 
test development (c.g., Educatipnal Test- 
ing Service, National Institute of Educa- 
tion), become involved iri item analysis 

' and selection, and provide feedback on 

content and fo'fmat of tests. 

2. Plan and negotiate for, or otherwise 
reach agreement with the school admin-, 
istration on, released time and district 
in-service education programs to prepare 
members in the use of tests. 

3. Jlan professional activities in the area of 
testing for all members of the associa- 
tion. * ' ^ 

4. Seek and participate in in-servicc.tr:iining . 
in the area of testing to learn to con- 
struct and evaluate te^chcr-madc tests, 
to learn about objective- or criterion- 
referencing, to learn about alternative 
assessment tools, to Icam appropriate re- 
porting procedures, to develop an aware- 
ness of the variety of tests and their pur- 



poses, to keep abreast of latest research 
. findings, and to develop the ability to , 
<^alyze arid criticize standardized tests 
as they relate to school and district pro- 
grams and goals. 

5. Work to influence test makers and the 
local and state school systems and secure 
from them a firm commitment to evalua- 
tion programs that- will lead to the im- 

» provement of instruction. 

6. Keep parents and other interested com- 
munity groups informed at out trends 
and promising developments in evalua- 
tion procedures and about unsound test- 
ing practices. 

7. - Negotiate for^, or otherwise reach agree- 

ment with the school administration on, 
provisions guaranteeing teacher lead time' 
for preparation for testing, appropriate 
testing conditions and scheduling, and 
follow-up tijne for scoring. Provisions 
should spell out teachers' appropriate 
role in the test-scodng process, e.g., to 
remedy the inordinate amount of time 
. spent on hand scoring; 

8. Thoroughly familiarize themselves with 
tests to be given (assuming they have 
been furnished with appropriate back- 
ground materials and sufficient time for 
learning about administration of. the 
instruments). 

9. Develop an understanding of th*eir stu- 
dents' cultural and socio-economic back- 
grounds and sensitivity to their indi- 
vidual'iiceds and problems in order to 
avoid, the possibility of irrelevant and 
biased testing. 

10. Periodically ( review t^sts to determine 
their relevancy to instructional goals and 

. objectives and- their timeliness, and 



*The'tcrm ^valuation systeths is used in'^tead 
of tests because it is believed that a wide variety of 
alternatives to tests should and can be developed 
througih research and tryout leading to their 
validation for evaluation purposes. 



25 



1 



25 



recommend to the school administration 
and the testing industry abandonment of 
irrelev^uit and outmoded tests. 

11. Secure by appropriate.-^ means— from the 
school or schoor district administration, 
as deemed neccssary-thc riglit to deter- 
mine what tests will be administered, 
when they will be g^ven, and at what 
intervals. They should also secure the 
right to determine exemptions from test- 
ing. 

12. Secure by appropriate means— from the 
school or school district administration, 
as deemed necessary— the right to deter: , 

^ mine proper physical arrangements and' 
time frames for testing as appropriate for 
themselves and their' students. Time 
« allowed should he sufficient for 
thorough orientation of students to the. 
test being given, and for scoring and re- ^ 
porting- result^ ^ ^ \: 

13. Be responsible for providing a non- 
threatening attitudinal atmosphere for 
students during testing sessions, given 
the proper conditions. 

14. Assure that .machine-scored resufts are 
validated by hand scoring a sample of 
tests. . 

15. Take an objective approach in inter- 
preting test r<lsults, never using them as a 
weapon against studmts. 

16. Seek to ensure that test results are not 
used lo categorize students into homo-\ 
geneous groups or as a criterion for stu- 
dent admission to programs of their 
choice. 

Strive for accuracy in interpreting test 
results, relating them to socio-economic 
factors affecting individual students. 
Have respect for student privacy in inter- 
preting test results and mai^.ife<it that 
respect by working to secure schqpl dis- 
trict policies guaranteeing students* 
privacy in the reporting and dissemina- 
. . tion of tect results, which should not be 
for public information. 

19. Urge strict enforcement of the federal 
Privacy Act affecting pupil records. 

20. Work to secure legislation which will pre- 
vent publication of test scores. 



17. 



18. 



21. Work to secure legislation which will pre- 
vent the use of test results as a basis for 
allocation of local, state, or federal edu- 
cational funding. * 

22. Assure tljat test results are not compared 
among classrooms or buildings or with 
other districts or regions. 

23. Report o«i test results in a manner appro- 
priate "to a varied audience—students, 

. parents, media, professionals. ^ 

24. Recommend genei^l and Specific pro- 
gram improvements to the school and 
ichool district administrations, and to 
effect the improvements, .identify the 
needed resources and remedial measures 
and programs. 

25. Secure throu^ the appropriate means— 
from the school or school district admin- 
istration, as deemed necessary— the stipu- 
lation that test results will not be used in. 
evaluating teacher performance. (Teach- 
ers should be held accountable for con- 
ducting tfie best instructional process 
possible under existing conditions, not 

. ^ for guaranteeing learning.) 

26. Take a position in favor of the inclusion 
of courses in tests and measurements in 
all teacher preparation programs, and 

^ provide input on testing ^problems and 
''issues to their representatives on profes- 
sional governance boards or commissions 
to help in the formulation of standards 
and requirements for teacher education" 
.^4id licensure. (There is little, evidence 
, ^ that most preparing institutions or states 
specifically require or« encourage class- 
room teachers to acquire the knowledge 
and skills necessary for using tests.) 



B. Other Professional 
the Following: 



Associations Should Do 



1. 



^Search out and synthesize information 
on all issues associated with thedcvelo})- 
ment, use, and abuse of tests and com- 
municate to the members any informa- 
tion affecting them or their students. 
Organize study committee^ of members 
knowledgeable in testing to develop 
policies, guidelines, and procedures for 
testing. Such Committees should seek in- 



ERIC 



26 



put from all members and consultation 
frorn experts in the fields 

3. ^rve in a ^Svatchdog" capacity, on the 
introduction and administration of cur- 
riculum-related tests to assure their 
•appropriateness for schools, and com- 

\ municate regional concerns to the testing 
industry. 

4. Pursue needed changes in school cur- 
riculum programs as identified through 
(he results of testing, this in cooperation 
with other associations in the region 
which represent cony«rable educational 
and socioeconomic conditions. 

5. Identify alternatives to stiindardized test- 
ing- . s 

6. Provide background information and 
regional concerns to those responsible 
for drafting or introducing state legisla- 
tion, and work for passage of legislation 
^to regulate types of tests and uses pf the 
results. These efforts should include call- 
ing for the testing of students m their 
dominant language (except, for example, 
prpficiency tests in English). 

7. Urge strict enforcement of ,the federal 
Privacy Act affecting pupil records. 

Students, Individually or Collectively as Ap- 
propriate, Should Do the Following: 

L Seek, a role in tjie development of test's 
through represcJntation on school, dis-" 

>^ ' trict, and testing industry committees 
and by providing feedback on test con- 
tent and format. ^ 1 

2. Take* positions against the use of mea,- 
surement instruments that they feel are 
» biased and will lead to unfair results on 
the. basis of race, sex, socioeconomic 
sthtus, language; or culture, and make 
these positions known to the school and 
school district administration and the 
testing industry.^ 

3» Makei'every effort, assuming they have 
been afforded proper ^orientation, to 
thoroughly understand the purpose, and 
intended uses of results, of any 'test to be 
administered in which they will be in- 
volved Studeixts should have the right to 



refuse to take a test known to be racially, 
culturally, or otherwise biased. 

4. Seek a role in determining ihc conditions 
of test administration— iniJluding sched- 
uling, preparation, length, location, 

. facilities. (Many tests are administered 
under adverse conditions, with* little 
attention given to the total physical 
environment and insufficient time allow- 
c cd for orientation.) 

5. Call attention to any physical or atti- 
tudinal,pressures in the administration of 
tests which they feel threaten them or 
their performance.' - 

6. Insist that they be given a thorough ex- 
planiition of test results in a meaningful 
way and m language they can under- 
stand. ■ • 

7. Take a position on the use of test results, 
demanding guarantees of privacy and the 
right to determine "^to whom the results 
will be released, insisting that results not 
be used to demean or Categorize them or 
to deny them admission to programs of* 
their choice, and urging strict enforce- 
ment of the federal Privacy- Act, which 
affects pupil records. 

8. Seek a role in deciding on alternatives 
for meeting student needs as identified 
through the results of testing, insist on 
the right tc choose from among alterna- 
tives, and become involved in the 
planning of remedial programs. 

Local Studfcnt Action for Education "and Stu- 
dent NEA groups might assume the leadership role* 
in involving all students in the-.evaluation programs 
of the school or school district and scac as the 
voice ^of student opinion and the vehicle for their 
protection against the adverse effects of evaluation. . 
<r * ^ , ' " ■ 

. ^ v., - - 

D. Minority Groups Should, Do thc^Following: 

1. Actively seek rcpresentati>>^involvcment 
on testing industry and si'rhobt system 
decision-making groups for. test develop- 
ment and use. ^ 

2. Urge test makers to (a) reyisc, teats in 
considenition of minority " differences, 
eliminating culture-related items from 



current tests and working toward xross- 
fcultural instruments^ (b) research ethnic 
and regional test requirements .and with- 
draw^ t^Jsts found to be inappropriate to 
the population being tested, and (c) 
explore and recommend alternative 
. forms of student evaluation. 

3. Request from the testing industry docu- 
mentation on riorming procedures and 

^ population bases for norming. 

4. Keep members informed of improper 
test procedures and seek support or legal 
assistance where tests and results are mis- 
used. 

, . 5. Urge minority students to refuse to take 
^ tests which are. found to be biased and 
urge minority - teachers to ' refuse to 
administer such tests. ^ 
6. Work to prevent the invasion of student 
privacy in interpretation and use of test 
' resultis. 

^ ' 7. Work for legislation to prevent publica- 

tion of test scores and for enforcement 
of the federal Privacy Act affecting pupil 
records. 

8. Promote legislation to prevent the use of 
test scores as a basis for allocation of 
, . - local, state, or federal educational fund- 

4 ing. . 

/ . " 9* Take strong positions and action against 
the .use of tesjt results for tracking, to 
denigrate minority intelligence, or to 
' ^ deny students entrance to programs. 

10. Expose the errohjeous contentions of 
Shockley and Jensen that some groups in 

. society are genetically less intelligent 
than others. (The^ Typical group test' is 
considered (a) an unreliable measure of 
mental ability and (b) to be ^ biased 
against minorities, having been st^- 
dardized on a different kind of popula- 

11. Actively seek changes in curriculum (in- 
cluding textbooks) to reflect minority 

\ concerns and diagnostic 'services based 

l> I * on student needs a,^ identified'by appro- 

.priate testing.' 

12. Seek community support and funds for 
appropriate new or experimental educa- 
tion programs based on needs identified 

through means other than testing. 

.* 

" ©■ ' - ■ 

ERIC ' . . . 



27 



13.* Become involved in planning and pro- 
* viding'pre'- and in-service education for 
teachers "^o orient them to minority 
problems and needs related to testing. • 
"14. Seek public awareness of and concern 
for minority problems in testing, and 
pressure community media to help keep 
the public informed, especially on issues 
related to proper interpretation and use 
of test results. 

15. Form coalitions for action in the 
development and use of tests. 

E. The Testing Industry Should Do the Follow- 
ing: 

1. Include in test development substantial' 
numbers of persons from all groups that 
have an interest in and knowledge about 
testing, particularly representatives of 

" classroom teachers and minority groups. 

2. Be responsible for proilucing culturally 
fair and bias-free tests that contain rele- 
vant items.. 

3. Worlc with all concerned groups in con- 
stantly monitoring, updating, and re- 
vising, their tests. The industry should 
immediately withdraw out-of-date tests 
from the market, as recommended by 

. those who use them. 

4. Take regional diversities into considera- 
, tion,in constructing tests to ensure rele- 
vance of test items. 

5. Correlate tests to current and developing 
curricula. • / * 

6. Improve sampling tecliniques and broad- 
en sampling bases. 

7. Undertake in-depth research and 
development to perfect a wide variety of 
alternatives to standardized norm-refer- 

, enced tests* 

8. Provide with each test copy a cover 
document specifying what the test is 
designed for (to reveal depth of subject 
knowledge; to verify reading comprehen- 
sion, to establish equivalency, etc. ) and 

3 what groups (e.g., "early childhood," 
"later elementary'') it is appropriate for. 
The document should alsQ include a re- 
lease form for student signaturfertestify- 
ing that "I understand the purpose of 



28 



the test. . or "I taking test under 
protest. " - ' ^ 

9, Provide an up-to-date mahual with each ^ 
standardised test, issued in English and F, 
other appropriate language editions 
depending on the student population.* 

The manual should give clear and com- 
plete, information for administration of 
the test, including proper physical ar- 
■ rangement§; define proper and improper 
uses of the test, warning particularly 

'.against using the test for purposes of . 
teacher evaluation; explain various y^ys . 

.of interpreting results, providing infor- 
mation on the basis of norming to ensure 
{Proper interpretation and hicluding a 
. "Surgeon General's warning" on the 
dangers of misinterpretation; delineate 
limitations of the, test. 

10, Provide with each test, not just bench 
JUarksj but a range of scoring norms. 

11, ^ Constantly monitor the distribution of 

standardized tests to, ensure- proper use, 
respond promptly to ch;}rges of. misuse, 
and refuse to sell tests or report scores 
where misuse is evident* 
12i Provide in-service training for teachers 
and administrators in. the use of stan- - 
dardized tests; provide consultants, and 
test administrators to assist teachers in 
giving^ tests and developing sensitivity to 
testing condition^; ^and have representa- 
tives available as resource persons for in- 
terpretation of tesf results. 

13. Provide information on the use of stan- 
dardized tests and interpretation of re- 

. suits to schools of education and urge 
them to include courses in tests and 
^ measurements in their required profcs- q 
sional preparation for teachers. Such 
courses should include fnstruction on 
limitations of tests, potential bias, and a 
broad range of alternatives to testing. 

14. Develop recommendations for curricu* 
lum revisions §s related to test results in 

* order to help teachers in planning re- 
medial programs<for students. 

15. Establish an extensive^ PR program to 
keep the public informed on testing 
issues and developments, issuing infor- 



mation .materials in English and other 
language editions. 

School Administrators Should Do the Fol- 
lowing: 

L Ensure that, "when appropriate, all tests 
to be administered reflect the uniqueness 
of the gibographic region in which they 
arc administered and that locally 
developed and standardized tests reflect 
updated curriculum. . ' ' 

2. Involve teachers, students, and parents in. 
decision making related to the testing 
program. 

3- Ensure that all teachers who must^ad- 
ministei* tests are*provided with adequate 
.supplies for the students, proper physical 
arrangements, and thorough orientation 
time, including practice testing* 

4.. Provide released tirfie for teachers for in- 
service training in the administration of 
tests. 

5. Ensure |^hat test results are not used to 
label students, that the confidentiality of 
scores is protected in a professional 
manner, and that the federal Privacy Act 
affecting pupil records is ertforced in 
school buildings and districts. 

6. Make available to teachers or specialists 
tools for diagnostic purposes and train- 

' ing,in their use. 

7. Keep parents informed about te^f results 
"(using nontechnical language) and keep 
the school board informed about the 
limitations and possible misuses of tests. 

8. Continually evaluate the total testing 
program. 

Appropriate College and University Personnel 
Should Da the Following: 

1. Serve a research function, providing to 
NEA and other concerned groups and to 
faculty in the school of education their 
findings on the use and misuses of stan- 
dardized tests (including tlieir own test- 
ing devices), test bias, and alternatives., 

2. Serve in a consultative capacity to the 
testing industry, providing information- 

, .on student population and needs, new 




4. 



curricula, college admission policies, schol- 
, arshipsy equal opportunity programs, and' 
the like. 

Serve m a com-ultative capacity to school 
systems for in-service teacher education 
and for decision making about curricu- 
lum changes based on the resDlts of test- 

'^"g- ... ' 
Seek,Jth6 involvement of practitioners in 
decision making, relating to professional 
preparation in tests and measurements. 
Monitor test results. from school districts 
in their region in relatibn^to new direc- 
tions for open admissions, equal oppor- 
tunity programs, scholarships, etc., and 

.keep junior and senior'high schools in- 
formed ^bout the .relationship of test 

, scores to admission policies and program 
choice. 

Form toalitions to influence legislation 
and provide expert testimony on the 
proper uses of tests and test results. 



Government Agencies r 

1. The U.S. Congress should lejgisiate re- 
straints on the use of tests that prevent 
equaI.^ducationaI opportunity, 

2. The approriate federal^ agencies should- 
• Provide equality control of testing 

by taking steps to restrain tlie test- 
ing industry .from publishing tests 
t hat are improperly consj^cted 
and by monitoring instruments to 
ensure their constant updating. - 
Provide technical assistance and in- 
form^ition to educators ^and the 
public regarding test development 
and use. 



4. 



• Increase research efforts in stan- 
dardized tests and alternatives. ' 

• Assure that teachers are involved in 
decision making about the use of 
revenue-sharing funds as they apply 
to the school system's testing pro- 
gram. ^ ' \ 

State education agencies should- 

• Provide consultant services, finan- 
cial assistance, and models for 
quality in-service education for 
teachers on the proper, administra- 

, tion of tests and on limitations of 
test results. 

• Provide for valtematives- to stan- 
dardized tests for state assessment 
programs.* 

• Prevent the improper distribution 
and administration 6f large-scale 
assessment .program materials by 
instituting sariipling procedures as 
opposed to blanket testing. 

Local education agencies should— 

• Provide released time' and quality 
in-service education for teachers 
and other school personnel on the 
administration of tests and use of 
results, 

• Prevent misuse of large-scalcf assess- 
ment instruments by instituting 
sampling procedures as opposed to 

. blanket testing. 
Education agencies at all levels should— 
•\ Involve teachers in decision making 
^a^est development. 

• ^ Provide^^^e funds for innovative 

programs t&^devdop alternatives to 
standardized testiitg^d interpreta- 
tion. ""^-^ 

• Provide the funds for long-range 
experimental testing programs. 



30 



WHY SHOULD ALL THOSE STUDENTS TAKE ALL THOSE TESTS? 



The NEA Task Force on Testing,^ jn its first 
interifn report, states: 

Th<? Task Force believe thcr,c is overkill in the use of , 
standardized tests, and that the intended purposes of ^ 
testing can be accomplished through less use ofiStan- 
darcBzed tests, through sampling techniques where 
tests are used> and through a variety of alternatives to 
tests. ... ^ ' - 

Representatives of the testing industry and others 
, told the Task Force ^ that. sampling of student popiila- ' 
tions could be as effective as the blanket appli(^atio;i , 
of tests that is now so common. Some suggested that 
such procedures, in addition to increasing the assur-^ 
*ance of privacyi rights, would conserve time, effort, 
and financial expenditure.'' ^> ' 

The blanket use of tests (every-pupil testing) 
in some state assessnient and local testing programs 
appears to require inordinate amounts of time and 
resources on the part of teachers, other personnel 
involved in test administration and interpretation, 
and the students themselves. ^ 

Criticisms pf the blanket use of tests have 
come from a variety of prominent researchers, 
evaluatorsr and other educatprs. ^ .4. 

flouse. Rivers, and Stufflebeam, in theif^'^ 
evaluation of the Michigan accountability system, 
concurred that in that state: 

Statewide testing as presently executed also raises the 
• question, of the feasibility of every pupil testing. This 
practice appears to be of dubious Vcdu6 when the cost 
of auch aji undertaking is compared with the resulting 
benefits to local level personnel. . . . The local, and 
hence overall, costs ciSuld be reduced by a matrix 
sampling plan which requires that each student tested 
take only a. few items. ... In the long run, a matrix 
sampling plan will be the only one feasible from a 
> cost and time standpoint The cost and time required 
for every pupil testing, for the whole state would be 
horrendous. ... We feel that it [strict adhercndc to a 
statewide testing model] will result in useless expen- 
ditures of monies and manpower, in addition to pro- ^ 
ducing unwanranted cBsruptions of the educational 
programs within a great number of schools.^ 

In a p>per entitled "Criteria for Evaluating 
State Education Accountability Systems,** the Na- 



tional Education Association has laid d&wn. fifteen 
basic principles, one of which is as follows; 

If the state desires test data for its own planning pur- 
poses, it "should use proven matrix sampling tech* 
niques' which will not reveal schoolsVand which will 
greatly reduce costs. 

Matrix sampling techniques can give an accurate 
picture of the state by various categories much nibfe 
efficiently .Uian testing, each child with an entire 
instrument.^; 

It was with such admonitions as these in mind 
that this chapter was written. And while some 
procedures are appropriate {or evaluating -^11 stu- 
dents in one way or another for particular pur- 
poses, it would appear that there is gross over-use 
of blanket testing procedures. 

To help teachers and other educators better 
understand" some .main considerations related to 
sampling, the NEA obtained permission from Dr. 
Frank Womer, Michigan School Testing Service, 
University of Michigan, to reproduce material from 
a monograph of his on 'developing assessment pro- 
grams.'^ In addition. Dr. Womer prepared, espe- 
cially for this paper, a section ofi item sampling. 
Dr. Wqmer*s recommendations follow the excerpts 
from his monograph. 

> ^ ****** 

^ 

Determining Whether Sampling Is To Be Used 

The decision whether to test an entire popula- 
tion or use a sample involves a combination of con- 
cerns. Cleariy there are policy considerations; clear- 
ly there are psychometric^ considerations; clearly 
there are data collection considerations; and clearly 
there are cps^t considerations. The best possible 
staff and cbhsultant thinking on this question 
should be brought to an advisory committee Tor 
them to Consider very carefully. ' . 

■ Probably the most crucial consideration is a 
policy one, since psychometrics, data collection, 
and cost generally would argue on the side of 



sampling rather, than using an entire population. 
If it is deemed wise for policy reasons to test all 
students in a population, that preferences typically, 
will have to be weighed against available resources 
and technology; so we will consider first the policy 
implications of the two choices. / 

One needs to look carefully at the purposes 
and goalsjof a specific assessment program in deter- 
mining wlvether sampling is appropriate. If all of 
the specific\ purposes and objectives of an assess-^ 
ment program can be met by group results, then 
• sampling miXst^ be cotlsidered. 

Tlie only . assessmenjf situation ' that clearly 
calls for common data collection on all members of 
the population is when it is deemed essential, for 
improved -.decision making, to. have exactly the 
same test information for every pupil in a ^ven 
grade in a state (or other assessment! imit). It is 
exactly this situation that has prevailed for years in 
local school districts that have every-pupil achieve- 
ment or ability testing at some grade level. His- 
torically, the compulsory/state testing programs 
were examples of this situation; the voluntaiY pro- 
grams were not. If a state mandates common test- 
ing for jail students it is taking over a role that local 
districts tradiltionaliy have Held. This may be goad 
or this may be bad depending on one's poiny of 
view of the role of a state department Of educa- 
tion. It^certainly has impbrtant policy implications. 

There are many facets to this point, but it 
should be. kept dearly in mind .thabiY is not neces-^ 
sary to test every pupil at a given grade levet^ on 
identical material in order to get ^ good picture^ of Z 
education outcomes of groups of students; it^is 
necessary only* if one feels that each teacher in an 
entire state at a given ^ grade level must have the 
same inforniation for each pupil. , 

Probably tfie greatest advantage of sampling is 
that for a given amount of effort (and money) one ^ 
can gather more usaole information than by using 
m entire population, \{ the goals S'f an assessment 
program are to gather statewide information only, 
it is hard' to conceive of any reason forttcsting all 
students in a given grade, , For example^ if there arc 
50,000.third-graders in the state of Lirhbo, and one 
wants to gather state statistics only, it is very 
possible that a sanyple 5,000 students (or even 500) - 
would be sufficient if they are selected by a 

probability sample ^ Or/ if one can afford to 

test all 50,000 third-graders, and if it is .deemed 
wise to do so, one could select ten 5,000-pupil 



samples and secure .information oil ten subject 
areas,.or one coGld go into great depth of informa- 
tio/n gathering in two or three subject areas. The 
combinations of possibilities of sampling pupils 
and content are almost endless. 

If one wants district-level information', then 
sampling becomes a different situation. In a school 
district with one third grade, sampling of pupils is 
hardly possible for most assessment purposes. In 
school districts with: many third-graders, sampling 
could provide a greater variety of information than 
common testing on every pupil, in the same 
fashion as' at the state level. Specific decisions of 
how,/ir Jo carry sampling should be made only 
after advice from a sampling statistician. Sampling 
is a highly developed technical field, and the 
implications of any decisions to sample or not to 
sample must be reviewed by competent samplers. 

Other compromise possibilities exist. One 
could- test alU students in a population with one 
short test,-. while using a sampling approach for 
other tests? This approach would provide some 
common information jon all students but would 
allow for greater depth of data collection over a, 
subject area. , • ^ 

Principle: Sampling of pupils and/or content should 
be given very serious consideration for all large-scale 
assessment projects. The only situation where it may 
not be useful is one where it is deemed essential to 
collect common information on all students in a 
statewide population of students. Sampling should be 
used to 'maximize the collection .of usable informa- 
tion for stated assessment purposes at the lowest 
possible'cost and efforts 

* * ♦ * . • . 

[Sampling with total tests is less complicated to 
administer, but since it is likely to he subject to error 
in administration and consequently less reliable, in 
some cases item sampling may be mo\^ useful. There- 
fore, Dr.rWomer was asked to prepare an additional 
statement on the purposes and potential of item 
sampling. His st:atement fol!ows.] 

Item Sampling 

The process of item sampling i^in testing is 
more useful for one Sf two purposes: * 

1. To increase the amount of group test 
results that can be obtained from stu- 
dents in^ given period of time, or 



2. To decrease the amount of testing time 
necessary to obtain large amounts of 
group test information from students. 

,For either purpose, it is essential to keep in 
mincl that item sampling is useful for gathering 
information about groups of students. Thus it is a 
technique for use with relatively large groups, not a 
classroom-sized group or even three or four classes 
within a building. 

Example 1 

A school system has 500 students in the sixth 
grade.' A standardized reading test is to be 
administered for a one-shot systemwide survey. 
The test takes 45 minutes to admini5ter, which 
is'^al! the time that can be taken from a busy . 
schedule at the end of the year. ' 

Staff are unhappy that only reading is to. be 
surveyed. Some major changes were made in the 
mathematics curriculum three years before and 
they feel it would be valuable to survey this 
subject also. By randomly selecting only 250 of 
the students to take the reading test, the other 
2j50 could be ^given a 45-minute mathematics 
.test at-^the same time. 



ExampFc 2 

A school system has IjOOO fourth-graders. It is *. 
desired to do an in-depth study of student out- 
comes for 100 different behavioral objectives in 
mathematics. Each^objective requires the use of 
eight questions. The total of 800 questions* 
would require one student to spend perhaps 15 
hours of testing tirne to atte nipt all of them. 

By randomly dividing up the objectives and 
items into five different subtests (eachvWith 20 
objectives and IGO itezns), each subtest could be 
administered to 200 students (randomly 
selected). This would require only 3 hours of 
'^testing time per student (manageable) rather ^ 
than 15 hours (unmanageable), and group results 
would still be available for all 100 objectives , 
* (800 items). - . ^ 

<^ 

: ' In either e^JampIe the results will be usable for 
group analyses* Any slight reduction in accuracy 
due to sampling error is apt to be much less than 
errors due to increasing testing time 6f students 
beyond some reasonable amount^ Systematic errors 
due to fatigue, disinterest, poor motivation, teach- 
er concern, and other conditions of testing can 
easily outweigh a small sampling error. 




33 



A TEACHER VIEWS CRITERION-REFERENCED TESTS 

• by Jean S. Blackford 



Recently, students and teachers have been 
questioning the use of standardized tests, including 
their administration, ranking, scoring, and report- 
ing procedures. We teachers deplore the use of tests 
to rank students and xeqoghize that use of stan- 
dardized tests does not, in fact, improve edufca- 
' tional programs. 

„ In response to these negative reactions, test 
developers have offered the criterion-referenced 
test (CRT), which tells something about an indi- 
vidual student without reference to the per- 
formance of any other student. Despite this 
advantage' of CRT over norm-referenced tests, we 
teachers must still 'consider several things as we 
become*^ part of the national movement toward 
.criterion-referenced tests. We must also ^seek 
relevant and constructive, action programs to learn 
about alternative assessment tools. ' 

Given the CRT methodology, teachers must 
have assurance that the test' items are directly 
based on the instructional "objectives that are in- 
cluded in their students' curriculum. Neither stu- 
dents nor we teachers should be assessed on test 
items that reflect* educational, outcomes not in- 
cluded in the local instructional program. 

It is important that objectives be developed at 
the local level and that they take into considera- 
tion the fact that all learning cannot be translated 
into CRT items. Evidence is fast accumulating that 
cognitive processes are measurable but that higher- 
level thought processes are very difficult to 
hieasure. Thus we find ourselves measuring very 
simple tasks. As teachers, we must be aware that 
the test questions direct}^ related to instruction are 
limited to assessing mastery of specifiers and do not 
aissess a student^s general ability: , 

Indfeed, Jf we are. pehnitted to formulate the 
instructional objectives for our own students, we 
must Use caution in selecting the objectives that 
meet the personal needs of the particular children 
we teach. Selection of improper objectives can lead 



to highly detrimental consequences. ^Narrowly 
structured objectives may be readily mastered by 
students, but they are grossly unfair to students 
and to the teacher. Those students who are taugfit 
only through a set of rigidly applied performance 
objectives are being denied the broad experience of 
.varied learning styles and creative teaching 
techniques. ^ ' 

Consider, for example, ^hat might happen if^ 
students were required to master in a 10-weck 
period -the following seven objectives taken from 
an elementary item^^ bank in the skill area of 
language arts comprehension: 



1. 
2. 
3. 
4. 

5. 
6.- 



■Retell a story 

Use a given word in a written sentence 

Name the class of a group of pictures 

Identify synonyms 

Identify antonyms 

Select a picture to match a sentence 

Recall facts for whot what, and where 

questions. 



If students miss the established mastery level, 
the teacher faces this dilemma: Have they missed 
by only a little? On the other hand, if students 
master the objectives in an outstanding manner, 
the teacher may not know whether their per- 
foimance indicates a high 111 ^-lihood of success on 
subsequent objectives or whether the CRT items 
were written in such a way that competency was. 
assured. ' ^ " 

Two other problems that plague teachers are 
related to the vyay criterion-referenced tests are 
developed and how their results are used. 

One is- that test items are developed in a 
hierarchy of difficulty which almost assures that 
certain percentages of students will not be able to 
respond' correcdy to some items. This flies in the 
face of the very philosophy of instructional objec- 
tives f that all stuclents should be helped to achieve 



34 



all objectives as fully as possible and that a major 
purpose of testing ought to be to determine >vhich 
students need more work on which object ivrs in 
ordCTthat they may achieve full mastery. 

^ lihe second, which is closely related .to the 
first, is the use of cutting scores, pass-fail points, or 
minimal competency levels,, Reporting and decision 
me^jklng based on such measures can result in the 
•/use bf criterion-referenced tests for the sorting and 
classifying of students-a practice that has been 



found so objectionable with means, quartiles, and 
similar statistics in norm-referenced'tests, 

Finally, if we are to pursue CRT as an aid in 
meeting the instnictional needs of our students, we 
must insist upon proper in-service education in the 
preparation of test items that are attributable to 
our instruction. The items must be developed in 
such a way that each item vvill require a specific 
response. We must be thoroughly instructed in 
sound and fair development of CRT items. 



1 



- / 



35 



GUIDELINES AND CAUTIONS FOR CONSIDERING CRITERION-REFERENCED TESTING 

■ , by Bernard McKenm 

. . ■ ^ 



Standardized achievement tests used in, most 
schools today are known as norm-referenced tests. 
They are constructed in such a way as to maximize 
differences among students so that one'can be 
compared to another. This is done by providing for 
maximum discrimination between high and low 
scores. The purpose is to rank a student among his 
or her peers.. Hence, scores are reported in such 
terms as **Chris Jones is in the ninety- fifth 
percentile on verbal reasoning." While norm- 
referenced tests are useful for sorting people into 
categories (to the dismay of many), iHiey are not 
useful for improving educational programs. 

Recently a new concept has been promoted 
among test makers and the educational public 
called "criterion-referenced testingf," also termed 
J ''objective-referenced testing." At least three 
factors have contributed to the emergence of this 
new concept: .First, there is a strong and risings 
dissatisfaction with tests in general; second, there is 
the inadequacy of traditional tests for diagnostic 
and instructional purposes; a^nd third, there is some 
clamor for evaluating instruction and teachers, as 
part of the accountability movement. Although 
criterion or objective-referenced tests may have 
potcntiar^or -diagnosing learning problems and 
improving \inriruction, they are not useful for 
evaluating ^eachers. For test scores depend largely, 
on variables in 'a student's background rather than 
on what he or she is taught in the classroom. Even 
so, a few years ago a bill was introduced in th;c^ 
Kansas legislature to cut off funds to district^ 
whose children did not score above the national 
average on such tests. Fortunately the bill did not 
pass." ^ r 

Criterion-referenced tests, instead of com-\ 
paring one child to another, presumably measure ^ 
the child's perforinance against a specified criterion 
or objectiviS^hus all children might be able to 
achieve the criterion and eventually score IQO 
percent on the tests. The criterion-referenced test. 



in concept, is much like the kind of test^ the 
teacher gives in the classroom on Friday to* 
evaluate learning of • specific objectives taught 

, earlier in the. week. ' . ^ 

Conceivably the external, criterion toward 
which the test is directed could hje a number of 
things. For example, one could have a criterion- 
referenced test ior measuring the skills of a brick- 
layer .without reference to how others do. for 
example: Can he or she lay bricks? Mix mortar? 
The higher on individual scored oncthe test, the 
closer that individual would be to acquiring a^ 
bricklayer's skills, regardless of how many other 
people had the same skills. 

Test makers, however, have shown little 
inclination to develop tests directed toward such 
criteria.. Establishing a sequence of skills and 
validating them is a laborious, difficult, multiyear 

- task at best. Staying with the example of the bnck- 
• layer, they wpuld have to conduct studiesHo show 
that good bricklayers score high on the test; that, is, 
they would ha,ve to evaluate the test. Test makers 
instead have resorted to a conception of criterion- 
referenced tests as those which .yield measurements 

; "directly interpretable in terms of specified per- 
formance standards."^ In practice, this means that 
the criterion toward which the^test is directed is 

, usually a prespecified objective, or objective stated 

,^ in advance, e.g., "A bricklayer must be able to mix 

^ mortar.". 

Thus criterion-referenced usually nieans in 
practice objective-referenced. In fact, those who 
have mdst strongly propagated criterion-referenced 
•testing are frequently the same persons who have 
propagated behavioral objectives. In typical 
procedure, objectives are established and test items 
are written . to measure those objectives. Test 
results can be reported in terms of what* specific 
objectives each . individual student was able to 
achieve, which presumably is useful for instruc- 
tional purposes. In this way, it is argued, tests cant^ 



36 



be tajlored to specific objectives the way a teacher 
tailors test questions on what he or she has taught. 
The distinction between criterion-referenced- 
?nd norm-referenced tests is quite blurred. Most 
test makers use similar procedures to construct 
items for both types, or use the same item, and 
employ test statistics for norm-referenced items in 
selecting items for criterion-referenced tests. There 
are np clearly defined and commonly agreed upon 
procedures for constructing criterion-referenced 
tests, and many of them are in fact norm-refer- 
• enced tests in disguise. The distinction becomes a 
matter of emphasis rather than being clear-cut. 

Frank B. Womer defines a criterion-referenced 
test as-r ^ 

. . .one which is designed to provide information 
about attainment of a specific objective (criterion), 
which emphasizes direct measurement through the 
use of differing formats, which may use items at vary- 
ing difficulty Fevels» which must have content ^ 
validity, which must minimize guessing, and which is 
particularly useful for instructional and evaluative 
purposes.*^ ^ 

Womer's "differing formats" term indicates 
he is keen on test items which call for responses 
other than . multiple^hoice. Mdny criterion-refer- 
enced tests continue to be made up mainly of 
multiple-cljoice items. 

A main advantage claimed for criteridn-refer- 
enced tests is their utility for improving educa- 
tional programs. In \acw ot the confusion among 
test makers themselves^ about the concept, con- 
struction, and utility of the tests, some caveats are 
in order for those considering the u^e of criterion- 
referenced or objective- referenced tests: 

1. Comthon deficiencies in * sting need to be 
communicated both to the profession and to 
the public. Neither criterion-referenced tests 
(CRT^s) nor objective-referenced tests 
(ORT's) eliminate the most common 
deficiencies of tests in general. 

CRTs and ORT's for the most part still 
measure, simple tasks at the e;<pense of releaming 
abilities and higher-lc^i^el thought processes.^ ^ 
Complex performances are so di'fficult to measure 
that test item? reflect only the simpler tasks. Such * 
things as Binet*s categories of mental imagery^, 
imagination, aesthetic appreciation, and moral 
sensibility are almost totally unmeasured. 



2. Teach(*rs should examine carefully the deriva- 
tion of the objectives for ORT's. 

? 

ORT's can be no better than the objectives oft 
which they are based. Unfortunately, the methods 
for. deriving objectives, are often ill-coasidered, 
hasty, and grossly inadequate. There h an . inclina- 
tion among test makers to slide over the problems 
of deriving objectives in order to get to item con- 
struction, a task with which they are more familiar. 
Yet appropriate objective*' are just as important 
and just as difHcult to arrive at as are test items. 

^ There are at least four ways to choose objec- 
tives.^ First, choosing by expert judgment means 
that a small group of subjcct^ matter experts 
decides which objectives should be measured for a 
given field. This was essentially the origin of 
National. ' Assessment tests. , While few persons 
would deny the relevance of the judgments of 
subject matter experts, few would contend that 
such judgments faithfully or completely represent 
what shoufid be taught. By no means do they fully 
represent the judgments of teachers, parents, stu- 
dents, and others vitally concerned. 

A second way of choosing objectives is by 
consensus judgment which requires that various 
groups.— teachers, administrators, parents, school 
board; etc.— decide what objectives are most 
important. (For the purposes of this chapter,- 
"objectives'* refers to specific student learning out- 
comes.) Unfortunately, the immense , problems of 
such prioritizing have been slighted. Frequently 
.decision-making groups respond only' to those 
objectives that are presented to them by a single 
group (e.g., school administrators) or a limited 
number of groups.^ Correcting important objectives 
that have been omitted is not taken into account. 
If critical objectives do not emerge' from the objec- 
tive-generating process they are ordinarily lost for- 
ever. For example, there is likely to emerge a hig^ 
preponderance of content-bound objectives that 
are' easily measurable. More subtle learnings are 
neglpcted. Attending to the objectives that are 
easily" identifiable severely limits the range of 
decision-makers' thinking and , results in deter- 
mining and limiting the curriculum. 

'The rating of priority statements themselves is 
severely dependent upon how abstractly theobjec-^ 
tives arc. specified (i.e., how global they are), the 
types of criteria on which the objectives are rated 
(i.e., rated in iniportance: how much nioney will 



5. 



37 



be spent on them, how much time and effort will 
be spent^ and the nature of the groups doing the 
rating),-^''' Test makers have had liftle experience 
polling the opinions of nonprofessional groups, so 
surveys for the purpose of developing or rating the 
importance of objectives arc likely to be highly 
class-biased. Actually, such surveys are seldom 
done* Objectives generation and measurement are 
likely ^to be treated in the most cavalier fashion, 
i Test-developers who would never think of in- 
cluding an item without field testing it sometimes 
accept and discard' objectives with abandon, A 
common procedure is lo have the objectives re- 
viewed by a small group of citizens and educators 
and claim that the objectives have been approved 
by the public. Those citizens involved are too 
frequently upper middle-class and the educators 
are selected in such a way that they are not broad- 
ly representative. 

A third way of deriving objectives is through 
curriculum analysis. One can inspect materials such 
as textbooks or courses of study to determine what 
is being taught and then write objectives and test 
items based on such content. Much of the impetus 
for CRT's came frpm' c-.irriculum developers like 
those who pioneered Individually Prcscripted In- 
struction (IPI) as part of their efforts, to develop 
tests that, measure exactly what the materials 
teach. This procedure also has its limitations^ in 
that it^ is likely to emphasize only content-related 
objectives. 

Fourth, objectives can be chosen by in-depth 
analysis of those instructional areas which one 
wishes to test. One tries to determine the contents 
and behaviors in an area of instruction and to 
associate objectives and test items with contents 
and behaviors. In other wor^s, by task analysis the 
instruction is broken into discrete learnings. The 
most ambitious efforts along this line have re- 
sulted in instruments called **domain-referenccd 
tests/'^^»5 

Domain-referenced testing (DRT) attempts to 
define domains of beKavior— categories of behavior 
one mi^ht test and teach for— and to represent 
these domains by an extensive pool of test items 
which rpeasure human performance in a particular 
domain or domains. In one sense, domain-refer- 
enced tests appear to be an attempt to escape the 
triviality and absurdity of much of the behavioral 
objeciiyes movement. If one must delineate a high- 
ly specific objective for each aspect of student 



behavior, one might generate thousands of such 
objectives. In one project an attempt to define a 
.complete set of objectives for the high school was 
givm up after 20,000 objectives had been written. 
A complete delineation becomes an absurdity and . 
most such lists become trivial. 

^ Domain-referenced testing aims at overcoming 
these problems, by defining important categories of 
content and„ Inr h a vio iL so that only objectives 
representing particular domains become important. 
Other objectives are merely subsets or examples. 
The instructional benefits of siich a scheme 
promise to be large since one could practice on 
other objectives and test items from the domain to 
learn the behavior. One could ^always construct 
another test from the innumerable objectives and 
test items representing that domain. 

DRPs exist more in promise than in practice. 
No doubt the task analysts will confront the same , 
formidable conceptual problems as have psycho- 
logists who try to categorize mental behavior and 
curriculum developers who try to define the strucr 
ture of their subject.. Even the most sophisticated 
schemes of human mental abilities such as Bloom^s 
Taxonomy, tend to falter when subjected to 
empirical examination. Human mental processes ' 
defy categorization which suggests emphasis on the 
long-debated principle of teaching to die whole 
child rather than to specific skills. ^ 

3. Teachers should have an extensive role, from 
the beginning, in deriving objectives and 
should beware of co-optation. 

Most teacher and public involvement in 
vie vel oping objectives has been cursory at 
best— more for the purpose^ of legitimizing the 
objectives than for determining or implementing 
them. For example, objective-referenqed tests were 
developed for the state assessment program in 
Michigan and employed on?a mandator/ basis at ' 
selected grade levels. For !^ the selected grades, 
subject specialists from the state education agency 
set up a small committee developed goals which 
•were later revie\yed by subject-matter associations. 
Then several one-day large group meetings were ' 
held around, the^ state to give, people a chance to 
respon d* 

Despite this effort to involve them, many of * 
the teachers and administrators who participated in 
the group meetings felt that they had not had 



ERIC 



38 



adequate input on the objectives.^ They were 
presented with a list of objecJtives and asked to 
respond .after a cursory review. Most teachers in 
the state never saw or heard of the objectives. In 
spite of promises that -the objectives were only for 
experimental purposes, the state agency developed 
tests based on them and administered them the 
following year» claiming educator endorsement. 

4. Which objectives are selected and retained for 
testing is critical for ORTs. Teachers should 
be intimately involved from the beginning in 
selecting objectives. 

Selection of final objectives' for testing is as 
important as generating them, and teachers are 
frequently provided only cursory participation in 
this activity also. In the Michigan assessment pro- 
gram over four hundred objectives were generated 
for fourth-grade mathematics, yet only Uiirty-five 
verc selected for testing. Th^c limiting factor was 
the amount of time required for testing each 
objective, for it" was deemed advisable not to 
exceed five hours of testing time. Wfiich objectives 
were- excluded? Why? If only the most important 
objectives were included, how was their impor- 
tance determined? What W9uld be the instructional 
effect over time of excluding the other several 
hundred objectives? In most cases of objective 
devclopme.nt, the objectives are rewitten and 
screened by state education agency officials, select 
citizens* groups, and test makers. For example, in 
Illinois goals derived from publfc hearings were 
— seipcted and extensively rewritten by several 
groups before being presented as public goals. 

5, The ivays in which test items are constructed 
should he examined. When possible, teachers 
should employ their own test expert^ to help 
them assess^ the procedures. 

The usual uMmber of items to measure one 
objective seems to vary from tliree to five. Good 
results have been obtained with five. Since even the 
most specific objective can' be measured by 
thousands of test items, selection is important. 
Sophisticated test, makers use a systematic 
sampHng plan that produces it^ms for sub- 
categories of the objectives. 

Of at least equal importance is the type of 
response the item calls for. Traditional tests use 



multiple-choice answers because they^ are easy to 
score by machine. However, if the purpose of the 
test is to describe and diagnose classroom learning 
and provide usable information to the teacher, 
multiple-choice answers may be much less desir* 
able. The degree to which a test is a faithful sample 
of learning behavior is more important in an objec- 
tive-referenced test than in one which merely 
strives- to differentiate among students. 

A group of items constructed by teachers is 
likely^ to be more relevant to. the in^ructTon of 
those particular teachers. Items written by 
measurement experts from a matrix of content and 
behavior are likely to be technically bet'.er but less 
relevant r , ' 

6. CRTs ayid ORTs shp^idd be thoroughly field 
tested. Teachers should refuse to use tests 
that have not been thoroughly field tested. 

While this may seem a rather obvious caveat, 
the fact is that many objective-referenced tests 
have not been extensively tried out. Even where 
tried out, frequently only a handful of students are 
involved. Tests with so little* field testing should be 
resolutely avoided. The test developer should be 
required to present details of the field test— the test 
developer who can*l probably hasn*t conducted 
one, which is an all too cominon occurrence. 

7. Test developers should present evidence of 
the test's reliQbility. Teachers should not use 
tests for which evidence on reliability is un- 
available. ■ 

For an "ORT, each .set of items used to 
measure an objective might be considered a test in 
itself. These should be reliable measures in and of 
themselves. The usual reliable determinants are test 
statistics which are measures of internal con- 
sistency developed for traditional norm-referenced 
tests. They are based on variations in individual 
test scores— item difficulty and the differcTOTeir 
between the top scorers as opposed toT)ottom 
scorei-s, for example. The reliability will be higliest 
when about half of the students get an item right 
and half get it wrong— a norm-referenced concept . 
maximizing discrimination among test takers. 

Using these traditional techniques causes the 
tests to discriminate in the same way as do items in 
standardized tests. Unfortunately, the ORT 



ERIC 



39 




developers have not been able to solve this prob- 
lem. The alternative is to have no evidence of 
reliability, which to many is even more un- 
acceptable. Perhaps the best policy is to insist on 
some measures^^i)f reliability, ones for which (he 
test developers supply a public rationale which can 
be assessed. 

8. The test makers should present evidence of 
the validity of the tests/ Teachers should 
inspect the validation procedures carefully. 

Validity— which depends upon the ability to 
answer the question, **D6cs the test measure what 
it is supposed to?''--presents another difficult 
problem for the maker of criterion-referenced 
tests. For traditional norm-referenced tests, 
validity is often established by how well the tcst- 
predicts concurrent aciidemic grades. B\it this 
makes little sonse for CRT's. Test developers are 
usually left trying to nwkc logical assessments of 
content validity based on how the tests were 
developed. 

If the test is objective-referenced, one can 
^ assess whether test items adequately measure the 
^ objectives and whether the objectives themselves 
"'arc valid for what the test is trying to measure. 
' If the test purports to meijsure the effects of 
classroom instruction, then the objectives must l)c 
the ones taught and the test items must be*" sensitive 
to instruction. The Michigan assessment pi >gram 
tried a sensitivity index to determine* if correctly 
responding to an item .was dependent on instruc- 
tion. The index didn't work in this situation. A 
highly specific objective might be valid for one 
class but not for ano*.I.tr, and a test which pre- 
sumes to be valid for -assessing instruction in a - 
whole state has the problem of demonstrating that ^ 
Its items and objectives were constructed in such a 
way as to be appropriate statcwidc-not an easy 
task. The whole problem of validity is an un- 
^ resolved one, but the burden of. proof should falj 
^n the test maker, not the buyer. 

No matter what the derivation of the test or 
whatr it is called, unless it covers what a particular 
teacher has taught it- cannot be a valid measure for 
that teaching situation; it is a measure of someone 
else's objectives. On the other hand^ if the test is a 
measure of objectives which the teacher developed 
but which he or she is willing to accept as indicative 



of his or her instruction^ then the qbjectives ;\re - 
valid for that teaching situation. v 

9. **Mimmal competency'' or **mastery'' cut-off 
-* points fof students should be viewed with 

some suspicion. Teachers should question 
arbitrary standards and substitute their own. 

Item difficulty on tests can be manipulated ^ 
easily by test makers. Whether a student scores 30 
'percent or 88 percent can be built into the test 
itself and just as easily changed by assigning 
arbitraiy values to test items. Since there is no 
objective means by which tests can establish a level 
of satisfactory competency, the setting of such 
standards is extre^nely arbitrary. What' is minimal 
competency jn reading? When has one mastered 
reading? On the other hand, one may be willing to y 
accept the opinions of certain groups as standards 
if they arc clearly recognized as group opinion and 
subject to all the deficiencies {hat implies. 

Nonetheless, many CRT developers continue 
to build highly arbitrary standards into their tests. 
For example, the Michigan assessment is based on a ' 
minimal skill concept that declares a student must 
achieve 75 percent of the minimal objectives. In 
the first year of implementation son^e of the dis- 
tricts where the highest academic achievement 
might be expected were able to achieve only 30 
percent of some objectives. The 75 percent cut-off 
was evidently without justification. 

10. Many objective-refcrencedi^tesU are really 
norm-referenced tests in disgttise. No teacher 
should voluntarily a(tminister a test that he or 
she does tlot xmderstand. 

If one constructs objective^ such as "reading a 
newspaper at a fourth-grade level,'** the norm is 
obviously built in. If one then selects test items 
using traditional test statistics, like item difficulty, 
^ and uses items from norm-referenced tests, the, 
result is a test that discriminates amoiig students 
but has the appearance of being referenced to skills 
rather than stii4ents. It becomes a norm-referenced 
test that looks V like a criterion-referenced test. 
(Some test ^cx})erts claim that it is impossible to 
construct aiiythirig other than a norm-referenced 
test.) It is also possible to use ORT results jn a " 
norm-referenced manner if one counts how many 



ERJC 



objectives eacli student learned and then makes 
comparisons among students. * . 

** ' 
1 !• The public and the profession should be made 
aware that CR or ORT\s are not panaceas. 
, . 7V5^ bias problems remain the same loith CR 
or ORT's as with norm-referenced tests. 

Lower socio-economic gioups will score as 
poorly on criterion or objective-referenced tests as 
they do on^^norfn-referenced tests* Basic factors . 
such ;i$ malnutrition and lack of motivation toward ; 
school and test taking are untouched by change 
from one type to another. What CRT's oiight offer 
some students is a reprieve from being, told they 
arc inferior. (In some districts test scores are 
attached to the report cards or .even reported in the 
newspapers.) Since self-confidence seems to, be 
critical in schooling, lack of stigmatization could 
be an important advantage, , Another advantage 
might be to spcll out in greater detail where certain 
educational weaknesses of students lie. Actually, 
CRT developers have done little that might result 
in preventing racial, social class, school-building, or 
neighborhood bias in;their tests.v 

12. CRT's cotdd cost more than traditional tests, 
d^'pending on the thoroughness of develop- 
ment. The costs of tests versus their utility 
should be carefully considered. 

Traditional norm-referenced tests already 
exist and do not need to be developed, so if CRT 
superiority can't be positively demonstrated, the 
question should be raised, "Why go to the extra 
time and expense?" Also, Because of their greater 
specificity, consider that CRT's might be valid for 
only a small domain of behavior at a given point in 
time (there c^uld be large rewards in this, of 
course, in promoting learning). Many iVior'* tests 
would, have to be developed rather than a few gen- 
eral ones. The procedure of developing and 
validating objectives and test items is a long, dif- 
ficult, and costly procedure when properly done. 

There are two ways* of reducing costs. One is 
based on the assumption that there are^ certain 
basic and necessary skills and stages of iearning 
independent' of the local setting and that oncneed 
develop only one test for basic reading skills- and 
sell it to^everyone* This is the assumption of the 
tes't makers, but it is a questionable one. Learning 



often seems to be highly context-dependents Chil- 
dren leani in different ways in different settings. 
The inability of educational research to come up 
with guaranteed teaching techniques and the in- 
ability of psychology to demonstrate transfer of 
training indicates this is so. 

Another way of reducing costs would be to' 
have local groups of teachers develop their own 
CRT's a3 they now do for' their classrooms. But 
there is the question of whether the amount of 
time required would be profitably spent in test 
construction.^ 

13. Teachers should not be evaluated on CRT's 
({nd ORTs any ,7nore than on nopn-referenced 
tests. Teachers should not allow themselves to 
. be evaluated on the basis of ANY tests. 

Tests are not good measures of what is taught 
in school. Although objective-referenced texts pur- 
port to be better measures of learning, they cannot 
be considered good measures of teaching. An 
obvious deficiency is that the tests measure only 
cognitive aspects^of the classroom": In addition, the 
teacher does not have control over many of the 
variables that affect test scores. Evaluating teachers 
is a use that should not be claimed for ORT's. The 
evaluation of teaching should be based on observa- 
tion, self-evaluation, student ratings, interviews, 
and many other types of data. 

14* A main advantage of CRTs or ORT's seems 
to be in the reporting of results, that is, avoid- 
ing blanket categorizations of children by test 
scores and providing more useful instructional 
information. Subtests should be used only as 
diagnostic instruments. 

Instead of a composite score with which the 
teacher can do little but type the child, in criterion 
or objective-referenced testing the teacher is pre- 
sented with specific objectives the student can or 
cannot accomplish. The avoidance of a single score 
categorizing the child is a major benefit. Pre- 
siujiably the teacher also will be better able to 
make use of the detailed objectives for improving 
instruction and learning. 

It should be noted, however, that there is 
little evidence that a teacher can do a better job 
working with specific objectives than working 
without them. Whether to use specific objectives 



41 



.-should remain'^ a matter of style and judgment for 
the individual' teacher. Robert 'E. Stake has 
indicated tliat there are significant costs in tising 
behavioral objectives, including the possibility that 
the teacher will teach only what is easy to 
meaisure.^^ In Michigan, most teachers did not find 
the ORT'S valuable for instructional purposes.^ The 
instructional benefits^ are also reduced by the 
limited number of objectives to which one can 
teach and for which one can reasonably test. . 
] 

15. While worthy of coyisideratton, t!ie claim of 
criterion, objectives, and domain-referenced 
tests should be viewed with some skepticism 
but with an open mind. Teachers should 
vigorously resist the misuse of all kinds of 
tests. 

In some ways CRT's can be viewed as a 
response by the testing establishment to avoid 
some of the criticisms of tests. Such was the 
motivation in Michigan. CRT's and ORT's still 
embody most of the deficiencies of tests in general^ 
and are not useful, for evaluating teacher&sin ac- 
countability schemes. The tests are also difficult to 
construct ^d are subject to much conceptual coil- 
fusion, even though they do offer the potential of 
being more useful for instruction. 

An important benefit of CR versus norm- 
referenced tests is that with CRT's the test taker is 
hot stigmatized by a glokjscore supposedly repre- 
senting his or her ability. Tliisss a great advantage. 
The best use of tests is in raising questions in the 
teacher's mind about individual students who 
achieve unusual scores. The tests themselves may 
be in error, or the teacher's preconception may be. 
In any pase, following up on seeming discrepancies 
is the job of the profcssionaj. Tests should be used 
to raise questions, not to resolve therp. 



GLQSSARY OF MEASUREMENT TERMS* 



Achievement Test 

A test that measures the amount learned by a stu- 
dent, usually in academic subject matter or basic 
skills. 



' • 42 



ApMtudeTcst 

A test .consisting of items selected and standardized 
so that the test yields a score that can be used^in 
predicting a person's future perforimnce on tasks 
not evidently similar tp those in the test. Aptitude 
tests riiay or may not differ in content fro^ii^ 
achievement tests, but they do differ in purpose. 
Aptitude tests consist of items that predict future 
learning of performance; achievement- tests consist 
of items that sample the adequacy of past leaiTiing. 

Criterion " * - ' 

A standard of judgment used as a basis for 
quantitative and qualitative comparison; that 
variable to which a test is compared to constitute a 
measure pf the test's validity. For example, grade- 
point Average attainment of curricular objective's 
are often used as criteria for judging the validity of 
an academic aptitude test. 

Criterion-Referenced Test 

A test in which every item is directly identified 
with an explicitly stated educational behavioral 
objective. The test is designed to determine which 
of these objectives have been mastered by the 
examinee. 

Grade Norm 

The average test score obtained by students classi- 
fied at a given grade placement. 

Local Norms 

Norms that have been obtained from data collected 
in a limited locale, such :^s a school system, county , 



*Excerpts from the revised edition of A Glossary 
of Measurement Terms: A Basic Vocabulary for 
Evaluation and Testing, published by CTB/ 
McGraw-Hilb Del Monte Research Park, Monterey, 
California 93940. Copyright © 1973.. Reprinted by 
permission of the publisher. 



I 



42* V 



.Qr state. They may be used instead of national 
norms to evaluate student performance. 

Multiple-Choice Item 

A test question consisting of a stem in the form of 
a direct question or incomplete statement an(f two 
or more answers, called alternatives or response 
choices. The examinec^'s. task is .to choose frojn 
among th? alternatives provided the best answer-Io. 
the question posed in the stem. , \ 

Nonverbal Test 

A test in which the items consist of symbols, 
figures, numbers, or pictures, b\it not words. 

Performance Test 

A test' that requires the use and manipulation of 
physical oT>jects and the application of physical 
and manual skills. Shorthand or typing tests, in 
yvhich the response called for is sipnilar to the 
behavior about which information is desired, 
exemplify work-sample tests, which are a type of 
performance test. 

Random Sample 

A sample drawn in such a way that every member 
of the population has an equal chance of being 
included, thus eliminating selection bias. A random 
sample is "representative" of itSvtotaf population. 

, Reliability 

The consistency of test scores obtained by the 
same individuals on ^ different occasions or with 



■V 



different sets of equivalent items; accuracy of 
scares. Several types of reliability coefficients 
should be distinguished. 

Coefficient of internal consistency is a measure 
ba^ed on internal analysis of data obtained on a 
single trial of a test (Kuder-Richardson formulas 
and the split-half method using the Spearman- 
Brown formula). n 

t \ . ^ ' 

Coefficient * of equivalence or . alternate, forms; 
reliability" refers to a correlation between scores 
from two forms of a test given at. approximately 
the same time. - ' , 

Coefficient of stability or test-retest reliability 
refers to a correlation between test and retest with 
some period of time intervening. The test-retest 
situation may be with two forms of the same test. 

Standardized Test 

A test constructed of items that are appropriate in 
difficulty and discriminating power for the in- 
tended examinees and that fit the preplanned table 
of content specification. The test is administered in 
accordance with explicit directions for uniform 
administration and is used with a manual that con- 
tains Teliable norms for th^ defined reference 
groups. 

Validity 

The ability of a test to measure what it purports to 
measure. Many methods are used to establish 
validity, depending on the test's purpose. 



43 



\ 



43 



THE TESTING OF MINORITY CHILDREN^A NEO-PIAGETIAN APPROACH 

by Edward A, De Avila ' ^ 
' ^ Barbara Havassy • 



The National -Education Association, the 
popular press, the courts, civil rights organizations, 
^ state" ai:id federal agencies, arid others have pointed 
to the/lailure of the 4est-publishing industry to 
consider fully the cultural and linguistic differences 
of minority children when constructing psycho- 
logical tests. Test publishers^ have responded by 
translating existing intelligence and nationaH^^- 
normed achievement tests into other languages^ 
.such as Spanish, adjusting norms for ethnic sub- 
groups, and attempting to construct culture-free 
tests. Each approach involves distinct probIem.s. 
Moreover, in our opinion, the tests as they arc cur- 
rently designed are of little use to anybody. 

Translating*^ existing intelligence or achieve- 
ment tests for non-English-speaking children often 
creates problems. First,'regionaI differences within 
a language make it difficult to use a single transla- 
tion in a standardized testing situation where 
examiner and e)^aminee are permitted virtually no 
' interaction. Thus, while tostojt means a quarter or 
a half, dollar to a Chicano,. it means a portion of 
banarfa squashed and fritd'to a Puerto Rican; 

Second, monolingual translations 'are inappro- 
priate " because the language familiar to non- 
English-speaking children is often a combination of 
,two languages as in the case of Tex-Mex. Third, 
V many non-English-speaking children have -never 
learned to read in their spoken. language. For 
example, many Chicano children speak Spanish but 
have had no instruction in reading Spanish. 

Another major - response of the testing 
industry to criticism has been to establish or to 
propose establishment of regional ' and ethnic 
norms/'Such a practice leads to lower expectations 
for minorities, which in turn may lower children's 
aspirations, to succeed. Furthermore, ethnic norms 
do not take into consideration the complex reasons 
ivhy minority children on the average score lower 
than Anglo American children on IQ tests. Ethnic 
norms arc potentially dangerous from the social 
perspective because they provide a basis for 
invidious comparisons between racial groups. The 
, tendency is to assume that lower scores are indica- 
tive of lower potqitial, thereby contributing to the 



self-fulfilling prophecy of lower expectations for* 
minority chidren and .reinforcing the genetfc- 
inferiority argument advanced by Arthur Jensen 
and others. 

In addition, if test publishers and users are 
willing to establish ethnic norms, they should also 
establish- norms based on sex differences. To take . 
into account both sex and all the ethnic subgroups. 
,) in the United States would require an .almost 
infinite set of norm tables. From the practical 
point alone,^ this is absurd. One might wonder what 
norms a publisher would use for a set of male/ 
female tvy ins who had a Mexican father and a 
Hungarian mother. 

The testing industry has also responded to 
criticism of conventional IQ tests by attempting to 
create Culture-free tests. Such tests are difficult to 
construct, and many question whether they- . 
achieve their goal of being free of cultural bias. 
Tests of mental ability and/or achievement attempt 
to determine the ability of a child to manipulate 
certain elements of a problem into a predetermined 
solution. It is difficult to conceive of test elements 
equally familiar to children of all ethnic or ailtural 
groups, especially when test developers are mem- 
bers pf a group themselves. 

In a large number of frequently u$ed iQ^and 
achievement tests; cultural influences on items 
cause the tests to measure something other than 
that for which they were designed. Thus, aside 
from what many tests set out to measure; to a large ^ 
extent they also measure- 

Socialization, Certain test items are 
^ . actually measures of the. child*s faipily value ' 
system. In tests marketed in the United 
^ States,^ the referent value system is, generally, 
■ that of the Anglo American middle class. 

This characteristic is particularly evident - 
in the comprehension scale of one of the 
major individually admihistered IQ tests. The 
test presents questions very much like thi? 
one provided by the publisher as a typical, 
but not authentic, item from the test: "What 
should you do if you see someone forget his 



ERIC 



book when he leaves his seat in a restaurant?" 
This type of question has little, or nothing to 
do with a child's ability to process, manipu- 
late, and/or Qode information.. The answers 
depend almost exclusively on whether a child 
has been socialized under the particular 
xethical system implied by the question. 

• Productivity cr level of aspiration, M^^y 
tests confuse what they hope tQmeasui:e with 
a measure of productivity or level- of aspira- 
tion. For example, in a large number of test^, 
the>child who produces the most responses 
receivcsxa higher score than the one who stops 
-responding^-^after only a few attempts. The 
assumption underlying this type of test is that 
' all subjects will produce as many responses as 
they are able, in other words, that they all 
have the same level of aspiration.* 

Timed' tests also confuse the' measure- 
ment of aspiration. In timed tests, whicK con- 
stitute the majority of published group tests, 
the tester asks children to work quickly, 
quietly, and efficiently. little regard is given 
to children who are not motivated to work in 
that \nanner. For the purpose of boosting 
statistical reliability, tests are constructed in 
such a way that children are asked a large 
number of questions which vary only a little 
in content. 

A similar problem involves tests which 
sequence items in order of increasing dif- 
ficulty. In these, children encounter in- 
creasing levels of failure and frustration.. For 
those, who start out fearfully, as do most chil- 
dren unfamiliar withVhe so(:iaI demands of 
the school or test sftuation, the first indica- 
tion of failure or difficulty discourages them 
from continuing. 

Experience ok specific learning. Tests' that 
rtquire answers of fact assume that all chil- 
dren taking the test will have had about the 
same exposure to the facts being tested. Any 
number of exaniples involving vocabulary 
bear out the spuriousness of this assumption. 
It is impossible |o determine whether 
minority children miss a test item because 
they have never been exposeci to the word or 
because they lack the capacity to understand 
the word. Problems of this type are found in 
virtually any test of ^mental ability which uses 



a score .on a vocabulary subtest to infer 
iiltimate capability. 

, One of the most widely used individually 
administered intelligence tests is full of 
examples of /the importance of spil'cific 
experience on 'test results. For»exarpple, the 
child is asked questions of vocabulary which - 
^vb'eUr directly on past experience or Exposure 
to tl\^ words being tested. , ^ , ^ 
Now let us consider the validity and u.tility of 
the IQ score. Forgetting for a :moment stan- 
^ dardized achievement tests, the griginal justifica- 
V- tion for the use of the IQ test was.that'the scores 
statistically predict mental retardation and low 
school achievement. Yet in 1971, sociologist Jane 
Mercer found that of adults" Who scored below 79 
on an individully, administered IQ test (and who 
would hav,e/betn labeled mentally retarded /liad 
they be^nN^cnooR:hildren)7^4 percent had .com- 
pleted eight grades or^mojre \y school, 83 percent 
had held a job, 80 percgnt were financially^ 
independent, and almost* 100 percent could do 
their own shopping and travel alone. In other 
wordsj, even at the task for which ex{.^*rls agree the 
IQ test is best suited— screening for mental retarda- 
tion—the lO measure probably has a dubious real- 
life validity. ' 0 

In addition to its traditional use as an indi- 
cator of mental retardation, many educators and 
politicians have come to consider the IQ test to be 
a useful instrument for teachers, school districts, 
and state and federal agencies. Indeed, many states 
mandate that districts administer IQ tests several 
times as a child goes through the school system. 
But do the results really help the teacher do a 
better job? 

Ixit us consider a typical example. A teacher 
suspects that a child has a severe learning disability 
and asks the school psychologist to test the child. 
After the psychologist gives an individually admin- 
istered test in which the child scores an IQ of 87, 
the psychologist writes up an extensive report of 
impressions of' the child's performance and 
potential. Upon receiving the report, the teacher 
responds in surprise, "But I knew all that. I want 
to know how I can reach this child." Thus, neither 
psychologist nor teacher is any wiser despite con- 
siderable time and expense administering and 
evaluating the^'lQ test. 

While few psychologistis would agree that edu- 
, cational decisions affecting a child's life should be 




made just on *the basis of an IQ score, tlie fact 
remains that such decisions are made by educators 
who, thr9Ugh personal fiat, supported by state 
mandate, ignore both individual subscale profiles 
and psycholbgists* admonitions for; the sake of 
practical expediency* The result is, of course, a 
form of default institutional racism. 

Thus; while much of the controversy sur- 
rbunding IQ tests and minority children focuses on - 
vyh'efher the IQ modVl is a valid one, a more prac- 
tical question concerns the general utility of the 
infor niation the test prmluces. In order to answer, 
one nrnst consider who is asking the question and 
why. Witbin the educational system, there are 
qualitative differences in the type of information 
needed, depending on the source of the need To a 
large extent, much of the confusion surrounding 
the issue of whether to test stems from failure to 
consider these differences. 

Severifi levels within^ the educational system 
require information traditionally obtained through 
IQ testing:- the funding level-which involves federal 
and state agencies; the local level, which involves^ 
district personnel and school principals; and the 
school level, which involves classroom teacher, 
para-professional, and parents. 

• Federal and state funding agencies* expect IQ 
tests to sup])ly them withMnformation concerning 
statewide or districtwide needs for the purpose of 
allocating funds* and information concerning pro- 
gram effectiveness. There would seem to be iar 
better ways of meeting the first need than trying to 
^ infer specific needs from an omnibus assessment 
based on so poorly understoqd a concept as IQ. 
Assessment procedures which can evaluate whether 
specific educational programs are needed in 
specific areas such as science would be more useful. 
Such' procedures exist, and these allow direct 
inference from test performance to program need. 

The second need— that of knowing about the 
■effectiveness of particular programs-has become 
particularly, demanding recently in light of ac- 
countability and evaluation/audit requirements. In 
response, these federal and state agencies have 
often mandated that IQ,and standardized achieve- 
ment tests be administered to evaluate programs. 

Actually, program evaluations can be made 
through a variety of procedures, none of which 
necesarily has anything to do with IQ or stan- 
dardized achievement tests. For example, a reason- 
able assessment can be made by interviewing 



administrators, teachers, parents, and children as to 
their perceptions of program effectiveness and by 
testing specific program objectives and reporting 
changes in group scores without reference to indi- 
- vidual scores. . 

Local school district personnel require infor- 
mation about the needs pf children and the effec- 
tiveness of programs in the same way as do the 
federal and state agencies. . However, since needs 
assessments are usually conducted at the state 
levels, local officers tend to rely on the state- 
provided infqrmation rather . than to conduct 
expensive research on their own. , 

Ideally, evaluation of individual programs 
should center around Collection of data dealing 
directly with program objectives and activities. 
However, instruments of evaluation often have 
little to do with the actual program; IQ or nationalr 
ly normed achievement tests are used, providing 
scores which often have little in the way of infor- 
mation about effectiveness. of individual programs 
^ and.program components. 

The last to be considered in the educational 
hierarchy are, unfortunately, classroom teachers 
apd what they need to assist the. learner. How can 
teachers translate numerical IQ sores into cur- 
riculum^ or instructional prescriptions? This .ques- 
tion* is' particularly 'perplexing because teachers 
cannot rely on absolute point differences on IQ 
scores. For example, if a teacher wanted to know 
what should be done differently for children with 
scores of 92 and 100, the answer would have to be 
"nothing'* because these , scores .are functionally 
equivalent. They are both within the "normal" 
range, i.e., within one standard deviation of the 
mean. However, when the same eight-point dif- 
lerence is between IQ scores of 84 and 92, there is 
a different implication. The score pf 84 is approx- 
imately one standard deviation below the mean 
and is, in some states, considered.to indicate that a 
child is in the "retarded" ,or "slow learner" 
category. In this case, the eight points which 
separate the 84 and 92 scores would necessitate 
different recomijiendations for the children in- 
volved. . 

. In many cases, the same criticisms apply to 
achievement tests that provide collapsed or sum- 
mary achievement scores. What educational dis- 
tinctipns and decisions can teachers make about ' 
children with reading grade equivalency scores of 
3.2 versHs 3.6 and 3.6 versus 4.6? Neither the IQ ^ 




ERIC 



46 



jscore nor the collapsed achievement, score provides 
• enough information on which to base sound daily 
educational decisions* 

Thjcse issues have brought us to consider an 
alternative assessment model which derives from 
the work of Jean Piaget. We have been working 
with Juan Pascual- Leone of York University, 
Toronto; in developing a neo-Piagetian procedure, 
which has been tested with approximately 1,100 
Mexican American and other children in four 
Southwestern states. Children were tested using 
standai;^dized tests of school achievement, IQ, and 
four Piaget-derived measures developed indi- 
vidually and jointly by De Avila and Pascaul- Leone 
over the past 10 years. 

The goals of this research were: 

1. To* test interrelations among the four • 
n6o-Piagetian measures in a sample of 
primarily Mexican American children 
who live in different areas and have 

, different socioeconomic backgrounds. 

2. To f*xamine the psychometric properties 
of these neo-Piagetian measures. 

3. To examine the relation between 
^ developmental level as assessed by the 

neo-Piagetian procedures and IQ :is as- 
sessed by standardized measures. 

4. To examine sex differences in perfor- 
mances on the tests. 



Results of this research have shown that: 

1. These measuresvi::xtnbit a developmental 
progression of perfonnancC'-scores across 
age in accordance with>^i^et's theory of 
cognitive development. 

2. performance of the primarily Mexican 
American sample is devclopmentally ap- 
propriate and within the limits of ex- 

^ pected levels of cognitive tievelopment 
^ for given chronological ages. 

3. There are no meaningful differences be- 
tween the sexes. 

4. ^ 'Scores of , children taking the tests in 

English, Spanish, or bilingually showed 
no appreciable differences. 



, y 5. There were .no ethnic group difference^ 
on the jieo-Piagetian measures of cogni- 
* tive development at the New Mexico 

location, the only place where direct 
ethnic group comparisons could be 
, made. There, were, however, consistent 
ethnic .group differences - on the IQ 
measures (Otis-"Lennon Mental, Ability 
Test) and on' the achievement theasure 
(Comprehensive Tests ,of Basic Skills) 
always in favor of Anglo Americans.^ 

i ' , , ' • ' 

These results have several implications. First, 
as this was a field study, further work is needed 
with greater control over such variables as language 
background, ethnicity, and achievement. With sqch, 
controls, the nature of the relationship between 
neo-Piagetian measures and traditional measures of 
capacity and achievement can be assessed with 
greater precision. Second, results of this study 
indicate that the relationship between cognitive 
development and school achievement, especially ot 
Mexican American children, must be more closely 
examined. Third, the failure to find a difference 
between Mexican American. and Anglo American 
children on the neo-Piagetian measures leads us to 
adopt the position that-Mexican American children 
dc^velop cognitiveiy the same as Anglo American 
children. It appears, however, ' that cognitive 
development in Mexican American children aAd 
perhaps others is not in itself a sufficient condition 
to engen'der a level of school achievement 
equivalent to that of middle class children. . - 

Failure of Mexican American children to 
achieve in school and to perform well on tradi- 
tional capacity and achievement measures must be 
attributed to reasons other than alleged cognitive 
inferiority. Some reasons for poor performance, we 
feel, lie. in the design characteristics of curriculum, 
and other classroom materials,, language usage, and 
the situational contexts or givens used in both test- 
ing and presenting curriculum. Culturally biased in 
favor of particular groups, they put all Other chil- 
dren at a distinct disadvantage. » - 

While these findings are of importance in 
understanding the cognitive development of 
Mexican American children, the more basic ques- 
tion remains: How can. the classroom teacher use 
the information provided by the neo-Piagetian 
^ approach on a regular day-by-day basis? 

47' ^ ■ : 



. In ah attempt to generate test information 
which directly* fulfills informational/instructional 
needs within the schools, we have designed a com- 
puterized system which deals with information 
needs of the three levels of school personnel dis- 
cussed previously. At the administrative level, this 
system provides group statistical data for program 
evaluation and needs assessment and, at the teacher 
level, provides classroom recommendations rather 
than scores. 

This systfcm simultaneously takes into ac- 
count achievement and developmental scores for 
bo^h the individual child and the child's referent 
group. It thus becomes possible to determine all of 
the^ossibje test outcomes and, thereby, to design 
individual computerized program prescriptions for 
each child tested. Workshops are then held with 
the teachers involved to discuss the implementa- 



tion of' these prescriptions. A copy of these recom- 
mendations can also be sent to the hom^e so that 
parents are aware of what the teachei is tryihg to 
accomplish with the child and can, with guidance 
from the teacher, participate in the child's educa- 
tion. . , ' 

This system, called. Program Assessment Pupil 
Instruction (PAPI), was tested successfully in thd 
same four states where data were gathered for the 
above described research. 

It should be noted that the PAPI system is 
designeH so that a child's peer or referent group 
can be designated in numerous ways, such as grade, 
sex,^or program group. . 

Thus far we have tested the PAPI system by 
working directly with classroom teachers, by ex- 
plaining the computer printouts, by listening to sug-; 
gestions, and by continuously refining our approach. 



48 



•48 



' / CRITICISMS OF STANDARDIZED TESTING 

' ^ ^ by Milton G. Holmen ^ 

■< - Richard Do cter 



The case for objective assessment of educa- 
tional achievement through standardized ability 
testing h based upon the idea that we ought to try 
our best to measure accurately what children are 
able to do* Such information, it is argued, should 

^ be of value to everyone genuinely concerned with 

, the continuing development and improvement of 
educational practice. ^But despite these wholesome 
goals, educational and psychological testing has 
come in for a great deal of criticism, especially 
during the past 20 years. Comment has included 
allegations that testing is linked to thought-control 
efforts; that there is manipulation and' undue in- 
fluence on school qurriculums, especially at the 
secondary Ic'el; and that tests promote an un- 
warranted invasion ot privacy. Criticisms have come 
from civil rights spokespersons; from educators; 
from the critics of education in America; "from 
sociologists, psychologists, philosophers; from 
politicians, journalists, and public administrators. 

' Criticisms of testing were especially bountiful 
in the years between 1955 and 1965.. No single 
foc^ point of discontent was identified; criticisnris, 
both major and minor, were hurled at testers in 
schools, in industry and government, and in clinical 

' and research work. (The best summary of this 
literature is a selected annotated bibliography-' 
prepared for .the Commission on Tests of the Col- 
lege Entrance Examination Board. A report^ pre-* 

' pared for this Conimission independently catalogs 
10 criticisms of tests. Some of these deal primarily 
with tests of ability or achievement, but most 
apply also to personality testing.) 

In our book. Educational and Psychological 
Testing, vye attempt to examine the testing 
industry and to offer a format to help evaluate the 
-adequacy of testing systems.^ But in this chapter* 
we have a more limited concern: our goal is* to 
offer a summary of the major criticisms pertaining 
to standardized educational testing. Please keep in 
mind that we arc not here oflering some kind of 
indictment of this testing; rather, we hope this 
identification of criticisms will contribute to the 
responsible development of this important segment 
of education. 



Tests discriminate against some individuals. It 
has been strongly argued that some testing pro- 
grams have consistently failed to take into account 
differences in cultural background and in unique 
individual attributes. Such failure unquestionably 
, influences test results and may, therefor£, penalize 
the testees. 

A major concern is whether tests developed 
primarily for use with Caucasian subjects can 
properly be administered to minority-group n\em- 
bers. Many of the latter may have educational and 
cultural backgrounds markedly different from 
those of the subjects used in the standardization of 
any particular test. 

Employment-selection tests have especially 
been denounced by minority-group representatives 
as too often containing built-in bias which favors 
the middle-class white person and discriminates 
against the minority applicant. While respected 
testing professionals may disagree on the inter- 
pretation of specific data purported to prove or 
-disprove this point, they agree that tests lacking in 
job-related validity have no place in selection-and- 
^ placement testing programs. 

Tests predict imperfectly. No standardized 
tests are perfect predictors of future behavior. 
Even the most enthusiastic proponents of objective 
assessment techniques would insist that their ability 
to foretell behavrdr is highly dependent on such 
factors as the individual(s) to be tested, the 
behavior to be predicted, the time over which pre- 
diction is to be attempted, and the criterion 
measures used to establish predictive effectiveness. 

But even with all these qualifications, critics 
of testing have come to the conclusion that many 
tests are weak and unsatisfactory devices which 
mislead naive t#*st users and result in harm to those 
tested. Many critics have ^'ust about given up on 
tests, for they see them as falling .far short of the 
ideal applications envisioned by their creators and 
their publishers. 

The problem of^test validation encompasses 
many issues that go beyond establishment of 
certain formal psychometric properties which may 
be present to some extent in any test. The proper 



49 



use of tests must encompass^ a variety *of 
responsibilities independent of the attributes of 
any particular test. We must not only ask whether 
^ ^f^^} shown to possess some ^nd of 

validity for a known group of subjects, but also 
must investigate many other questions bearing on 
the particular circumstances surrounding the 
application of the test. 

Test scores [tnay he rigidly interpreted. Test 
scores provide -one opportunity to establish a data 
base of individuals. Anyone interested in labeling 
people can have a field day with test results. This 
fact notwithstanding, the properly trained user of 
tests is supposed to know that test scores are not . 
fixed measures, that they are estimates of human 
attributes at best, and that, ,they necessarily en- 
compass various kinds of sampling errors. 

But test scores are often applied in rigid and 
arbitrary ways. In schools^ this can result in assign- 
ment of children to abilitV groupings based on 
measures which may be indefensible. The quality 
of professional practice associated with test usage 
leaves must to be desired. 

Tests may be assumed to measure innate 
characteristics, Scnu critics of ability testing have 
argued that it^iis provide scores that may be 
naively intei^>reted as measures of innate character- 
istics, such as "intelligence"; many harmful 
consequences' are said to flow from this miscon- 
ception. It has occasionally been assumed that, if 
tests were .not available, people would not make 
arbitrary classifications of individuals. Tests are 
therefore ^ condemned as antihumarlstic and as 
tostenng a view of humankind that sees human 
abilities as fixed or rigidly limited. 

Even worse, son^e critics have reasoned that 
tests influence individuals to conceive of humans in 
categorical terms, such as **mentally retarded** or 
"gifted.** They conclude tliat thinking of this kind 
is undesirable. 

At first glance, this seems to be nothing more 
than a variation on the practice of making rigid use 
of test^scores. The essential difference, however, as 
expressed by some critics, is that not only do tests 
foster the belief that one has fixed "intelligence** 
based on innate characteristcs, but also that the use 
made of test scores depends heavily on such a 
belief. 

The kind of school program offered and the 
energy invested in preparing a youngster for the 



< 49 



future may be directly influenced by an educator*s 
belief that tests measure innate intelligence. The 
egalitarian ethic in America frowns upon labeling 
based on some arbitrary measurement supposed to 
reflect innate characteristics. ^» 

Test scores may influence teacher expectation 
regarding student potential. In their classic study, 
Robert Rosenthal and Lcnore Jacobson'^ showed 
that, when teachers* expectations regarding student ^ 
potintiils were ^based on fictitious information 
about the students* abilities, the actual achieve- 
ment r students reflected these expectations. 
Those who were expected to achieve less did' 
achieve less, and vice versa. 

Critics of ability testing have argued with 
considerable force that tests of "intelligence** have 
highly undesirable^, consequences for student per-J 
formance because, af least in part, teachers tend'(p 
relate to students differentially, according to their 
supposed intelligence. Studerfts who are singled out 
as "gifted** or **low ability** are given different 
assignments, rewards, and teachers, and they ari 
systematically taught what is expected of them. \z 

There seems little argument that teachers* 
expectations contribute to student performance. It 
is less clear what factors shape teacher expecta- 
tions. Test scores may be important in determining 
differences among students for some, teachers; 
ho\yever, we need to know far more about the en- 
tire matter of teacKer expectanry, fpf many other 
variables may in fj»ct help to determine their 
attitude-. 

Tests have a liarmful effect the shaping of 
cognitive styles, .The widespread use of multiple- 
choice test items, matching items, and other test 
components with .a single correct answer is said by 
some critics of testing to contribute to undesirable 
styles of, thinking. Some claim that the yopngstu-. 
dent is carefully taught that all problems must have 
a right or wrong answer, and thus the student is led 
to think in this manner about all questions. 

Tests shape school curriculums and restrict 
educational change. When teachers know that the 
evaluation of their students will be based on a 
particular kind of test of some more or less predict- 
able content, they make extensive efforts to assist 
their students to perform well on these tests. The 
proponents of statewide testing programs would 
probably argue that this is exactly what they have 
in mind— that teachers ought to be encouraged to 



ERJC 



50 



1 



50 



cover material ^which their colleagues consider 
essential, **What's \yrong with this?" they ask. 

Critics of testing say that experimentation 
with new ways of teaching, the introduction of 
new subject matter, and the whole process of indi- 
vidualizing instruction in terms of the needs and 
, interests of ir 'widual students are hamstrung by a 
slavish adherence to standardized achievement 
testing. The question seems to come down to find- 
ing an a.cceptable balance between the need to 
know what has been learned during a given period 
of time and the Encouragement of innovation, 
change, and experimentation in the classroom. 

Tests distort the individuals self-concept and 
level of aspiration. Of all the criticisms of tests, one 
of the mpst penetrating and difficult to dismiss is 
that young persons may generalize from test results 
and make conclusions about themselves which arc 
not warranted or intended. For example, consider 
the teenagcdr boy who is struggling to establish a 
niore positive and more realistic self-concept. How 
helpful is it for him to be shown his low test scores 

* which may make him conclude that he is far less 
capable than his classmates? 

How many high school students have received 
brief and inappropriate counseling recommenda- 
tions, usually based in part on test results, and have 
concluded from these recommendations that they 
are not **college material"? One large school^ dis- 
trict, for example, regularly presents junior high 
school students with test result summaries printed 
on cards that the students take home to their 
parents. These cards offer a lucid'and easily under- 
standable summary of what the various achieve- 
ment and aptitude scores mean. Althou^ the 
intent is to make, information available to parents, 
there are obviously risks in terms of shaping the 
attitudes of students toward themselyes. 

In our view, the proper handling of test re- 
sults calls for neither a strategy of silence and 
secrecy nor for open distribution of data vvathout 
discussion, clarification, and interpretation of 
meanings. 

Tests select homogeneous educational 
groups. A common procedure in organizing a 
school is to assign students to classes on ih6 basis 
of estimates of learning ability. Very often these 
estimates are based on ability testing. It is a short 
step to conclude that tests have determined the 

* organizational style of^schools, ahd it may. surely' 



be argued that tests do indeed contribute to the 
way in which students are assigned, 

^Critics of the ability-tract system, as this 
arrangement is often called, frequently sec educa- 
tional testing as the bad guy. But, were no test data 
available, an educational administrator dedicated 
to ability-track grouping could find numerous 
criteria, such as grades, teachers' ratings of ability, 
aod so forth, for making these assignments. 

Concerns about homogeneous grouping in 
schools have acquired strength with recent research 
which suggests that this allocation procedure tends 
to do more harm to the low groups than can be 
justified. The proponents of heterogeneous assign- 
ment to classes argue that children with lower 
ability need the stimulation ahd the role models 
provided by higher-ability students if they aVe to 
achieve as much as they possibly can. 

Contemporary approaches to school organiza- 
tion stress the importance of/providing a program 
of individual instruction for each child, regardless 
of the range of competences within a class,. Educa- 
tors are now stressing the positive influences of 
heterogeneous grouping, with the result ,that the 
track system Js generally thought to be on the way 
out. But for the parents of children who are 
assigned to low groups, the track system is an un- 
pleasant reality based primarily on test results. 
Hence, since tests ard often painted as the villain in' 
the situation, it is assumed that banning tests will 
eliminate the track system. 

However, with regard to a school district set 
on the perpetuation of homogeneous ability group- 
ing, the problem is not so much one of testing or 
not testing, but rather one of adherence to a ques- 
tionable concept of educational organization. 

Tests invade privacy. School attendance is"" 
mandatory for young children. Once in school, the 
children are generally required to participate in 
activities, including testing, which some parents 
consider to be invasions of privacy. 

Certainly few would argue against allowing 
schools to give tests to determine what a. student 
has learned in some course of study, but should 
schools be allowed to require students to take 
intelligence tests? What good is such information 
to a school? Can data . from so,'ne tests be used to 
the disadvantage of students without their knowl- 
edge that such information even exists? How can 
the line be more clearly established between infor- 



ERJC 



51 



51 



mation that a school Requires to help reach a 
legitimate decision and information that it has no 
business acquiring in the first place? The right to 
privacy is precious to the citizens of a free society; 
only when there is compelling justification should 
tests invading privacy be used. 

At the heart of the criticisms about tests and 
testing programs is one fact that is likely to help 
perpetuate at least some of the criticism: Tests are 
often used as tools for the allocation of limited 
resources or opportunities. Put another way, edu- 
cational and psychological tests are frequently 
designed to measure differences among individuals 
so that one person receives a reward or privilege 
which another is then denied. 

- For example, we see this in the assignment of 
elementary school children to classes for the gifted 
or in the. selection of students for college admission 
or for advanced professional study. Tests^ there- 
fore, are likely, to stir strong emotions^ for they 
serve in' many different ways as gatekeepers, open- 
ing and closing pathways of human opportunity. 

Are tests necessarily the kind of gatekeepers 
we want? This is a question involving individual 



values, organizational goals, and, increasingly, laws 
and regulations designed to assure equal access to 
educational and employment opportunities. One 
thing is certain: Tests are no longer granted any 
immunity or magical status, or are they assumed to 
be good simply because of their objectivity or 
psychometric purity. The lawmaker as well as the. 
citizen on the street has a skeptical eye on educa- 
tional and psychological testing./ 

There have been too many serious lapses of 
professional judgment, not only by .those who are 
using tests without the proper qualifications, but 
also by professionals who should know better. And 
minority groups* intense concern for fair play ^rela- 
tive to testing is not going to evapprafc; indeed, it 
will probably be expressed with increasing 
vehemence. 

However, while we may anticipate continued 
criticism of tests for a variety of reasons, testing 
programs that measure up to high professional 
standards and can be shown t)i make constructive 
contributions to human assessment may well be 
regarded as beneficial by, most people. 



PROBLEMS IN USING PUPIL OUTCOMES FOR TEACHER EVALUATION 

by Robert^S^ Soar 
Ruth M Soar . . 



During the past few years tnere has been 
mounting pressure for measuring the outcomes of 
education, with movement toward holding the 
teacher, the school, and the school system account- 
able for producing the student learning expected 
by society* Decreasing enrollments, titter 
budgets^ and a general trend toward cost effective- 
ness have added, to the pressure* 

Measuring pupil achievement increasingly has 
been proposed as a way of assessing the effective- 
ness of teaching and, in fact, has been mandated by 
a number 'of states* This approach is sUperHcially 
reasonable and attractive, but it is fraught with 
problems which have not been generally 
recognized* 

H. L* Mencken once commented, "There's 
always a well-known solution to -every hu«Tian 
problem— neat, plausible and wrong*** The use of 
pupil achievement as a way of evaluating the teach- 
er, the school, or the school system embodies this 
misleading 'simplicity* The solution seems so 
straight- forward: If the job of teachers is to 
promote learning in pupils, then it seems reason- 
able to evaluate them in terms of. the, amount of 
learning they produce in their pupils* 

The parallel with the industrial setting is 
dear: If the job of a worker is to assemble relays, 
then it seems reasonable to count the number of 
relays the worker assembles and pay him or her 
accordingly. But in applying this procedure to 
teaching, a number of problems emerge that have 
not been widely recognized* The relay assembler 
receives parts which are identical (at least within 
very close limits) on which he or slic performs a 
prescribed set of operations, also identical. Then 
each completed units leaves the assembler, again 
almost identical to the others* . 

But none of this is true for teachers* Pupils 
appear in the classroom differing in ability, level of 
achievement, home background, interest, motiva- 
tion, age— differing in numerous ways* Teachers 
must recognize these differences as they strive to 
help individual pupils grow toward their own 
potentials. Consequently, the teaching process will 
differ from pupil to pupil* If the teacher has been 
successful, each pupil will have improved educa- 



jtionally .when he or she leaves the classroom but 
each will probably be no more like the others than 
whifn the year< began* 

I A major dimension, then, of the problem of 
evaluating teachers in terms of pupil outcomes is 
the recognition that what goes on in the classroom 
is ftbt the only, of the most powerful, influence on 
where a\pupi! stands in achievement at the end of 
thenar*. 

Research has shown that the differences 
pupils-bnng* with them when they enter the class- 
room l^avc^^ignificant influence on achievement* 
Entry Icvefability (pretest or fall score) and socio- 
economic status are major determinants of what a 
pupirs standing will be at the end of. the school 
year* These influences probably are more widely 
accepted than any other, but they are highly inter- 
related so that one overiaps the other* In practice 
they cannot be effectively separated* 

The^^ fact that IQ and achievement scores in 
the fall are highly related to spring achievement 
scores is widely accepted but seldom dqcumented* 
In a study of SI fifth-grade classes, R* S* Soar and 
R, M* Soar^^ -found correlations between class 
averages (means) for fall IQ and spring achievement 
ranging from +.85 to +*90, and correlations be- 
tween fall achievement and spring achievement 
ranging from .75 to *85* So the evidence- is that as 
much as 80 percent of the variation in class aver- 
ages for pupil achievement at the end of tlie.year 
can be accounted for by pupil chai:acteristics which 
existed at the beginning of the year, characteristics 
over which the teacher has no control* , 

The most extensive data, on the influence of 
socio-economic status on pupil achievement were 
presented . in the C!oleman Report, and more re- 
cently and more widely re-analyzed by F* Mosteller 
and D* P* Moynihan^-^ and G* W* Mayeske, and 
others*^ The studies show that as much as 80 per- 
cent of the variation in pupil achievement across 
schools (equal to a correlation of about +*90) can 
be accounted for by these factors* 

Beyond these major influences there are 
others which help account for differences in pupil 
achievement and which should be considered* 
Although the research on family attitudes and 



53 



support for learning in the home is not as extensive 
as that for pupil^ability (pretest) and social status, 
it is consistent in indicating relationships between 
the educational values held by parents and their 
children's achievement in school. NL Garber and W. 
B. Ware^ found a relation of +,47 between achieve- 
ment and a combined measure of support for learn- 
ing in the home for a group of Black and Spanish- 
American children. All students in the sample met 
federal poverty guidelines, so that socio-economic 
status as usually measured was/in effect, held con- 
stant. The same authors cite similar findings from 
other studies. 

Peer group attitude, although again the re- 
search is not extensive, has been identified as 
another important factor which can either support 
or hinder a pupiPs achievement.-' 

• Since there is compelling evidence that a num- 
ber of influences over which the teacher has no 
control have powerful effects on pupil achieve- 
ment, it cannot be expected that a teacher will 
have consistent results with successive groups of 
pupils. That is, the teacher will not be equally 
effective in producing growth with all groups be- 
cause groups differ so widely. Studies by Barak 
Rosenshine-'^'' and J. E. Brophy,^ for example, 
show that on the average only about 10 to 15 per- 
cent of the variation in achievement from group to 
group reflects the stable influence of the teacher, 
as shown by a median correlation in the low .30*s. 

As D. M. Medley-'^ has pointed out, and as 
commonly accepted metliods'^ of estimating 
reliability show, data from about twenty classes 
would be required for making reliable decisions 
about individual teachers. Given this requirement 
necessitating collection of such large amounts of 
data, using the measurement of pupil achievement 
as a way to evaluate teachers is impractical as well 
as invalid. 

WTiat these findings seem to indicate is that 
the education of the pupil is dependent on many 
conditions in the society, not on the school alone. 
When the time the pupil spends in the classroom is 
compared vwth the time spent under other influ- 
ences, and when the degree of influence or control 
the teacher can exercise is compared .with the 
power of other influences, the limited effect of the 
teacher is not surprising. 

Because influences other than tlie teacher 
make a major difference in how much the child 



* 

- learns is not to say that the role of the teacher i^ 
junimportant. The teacher is the o»^!y formal, insti- 
tutionalized input which society makts for the 
education of the child, and the transmission of an 
established curriculum. And mMch. <)f what the 
, teacher docs that contributes constructively to the 
child's future abilities, successes, and satisfattipns 
may not be measured by currently common 
achievement instruments. It docs say, howevca:, 
that the influence of teachers is limited and that 
teacLers are most effective when tlicy have the 
support of other elements in the society. 

This whole constellation of other influences is 
usually not given consideration when 'measures of 
pupil achievement are proposed as the basis for 
evaluating teachers. It is reasonable that these in- 
fluences are strong, since they accumulate over the 
• life of the pupil. It is obvious, then, diat pupil 
standing at the end t)f any school year is a com- 
pletely inadequate and even misleadin;: measure of 
the effectiveness of the teacher or Jhe school. Yet 
the results of such achievement standings are 
frequently published by school qr by sthcbl sys- 
tem. v\ 

"Achievement," which is the niost frequently 
used measure of student learning outcomes, usually 
refers to the amour, of knowledge a pupil 
possesses at a given point-his-or her '^standinj." 
The influences cited above show a strong relation 
to achievement as used in this sense. 

An altemafive to measuring achievement 
standing is to measure ehange in achievement from 
the beginning to the en<i of the year. When this is 
done, the influences cited-are still likely to have an 
effect, although to a lesser degree, since change 
reflects their influence for a shorter period of time. 

Although this alternative is appealing as 
another way of evaluating teaching, it raises still 
other problems. In a cfassic volume on the prob- 
lems of measuring change, C Bereiter^ com- 
mented: ' ^ 

Although it is commonplace for research to be 
stymied by some difflcuhy in experimental method- 
ology, there are really not many instances in the 
behavioral sciences of 'promUing questions going un- 
researched because of dt.^cicncics in statistical^ 
methodology. Questions dealing with psychological' 
change may ^cll constitute the mOst important ex- 
ceptions. It Is only in relation to such questions that 
the writer has ever heard colleagues admit to havmg 
;^ abandoned major research objectives solely because* 
the statistical problem seemed to be insurmountable. 



54 



If the fall score is simply substractedirom the 
spring score so as to obtain a measure of net ^ 
change, a new sci of subtle but difficult problems 
is created. An illustration many serve to identify 
some of them. Firgure 1 presents fictitious .data 
from a group of pupils for whom measures of IQ 
^, fr6m two forms of a test have been obtained 10 
.days apart. The initial IQ's.are plotted on the 
baseline and the second IQ's on the vertical axis. 
Ally point in the area outlined by the ellipse 
represents simultaneously the IQ of a pupil on each 
of the testings, and the high and low 10 percent of 
the pupils at each of the two .times has been indi- 
cated by shading and cross-hatching. 

It is clear that the pupils who were in an 
extreme group on the first . test were not, for the 
most part, in an extreme group on the second test. 
The blackened areas represent the small number of 
pupils who were extreme on both occasions. 

At^lhe upper right> the area is small because 
'the pupils who make the highest scores at any test- 
ing are likely to do 50 on two bases; (1) they are 
bright (have high verbal skills), and (2) they are 
lucRy (that is, they happen to make good guesses 
on a few items for which they aren't sure of the 
answer, or the items on this test just happen to be 
ones for which they know the answers). But they 
are not likely to be lucky consistently when 
another form of the test is given, and so on another 
testing their scores are likely to be lower. Opposite 
influences will affect pupils at the lower left end of 
the ellipse. , ^ . 

, To put it another way, if the cutting point for 
the top 10 percent is an IQ of 120, there will be a 
number of pupils with true l(^s close to 120 who 
\vill sometimes be above tliat score on a series of 
tests and sometimes below it, depending on chance 
factors. So some fraction of pupils above 120 on 
the-vfirstj test will fall below it on the* second. 
Similariy, some of the pupils scoring below 80 on 
the first test will be above it on a second. 

In both cases, extreme pupils Jiave regressed, 
or moved, toward the mean. This regression effect 
can be expected whenever prediction is less than 
perfejct; and the extent of the movement will 
depend pn the inaccuracy of the prediction.^ With 
most psychological or educational predictions,^ the 
regression involved is considerable and may nriake 
up a significant proportion of the total range of 
scores. 



The point, to be stressed from this example 
has important consequences: Since pupils who 
were in the bottom 10 percent' the first time vvere 
not, for 'the most part, iu that group, the second 
time, they must have nxoved upward. Similariy, the 
pupils in the top group must have moved down- 
ward! That i^, there is a negative relationship 
between initial standing and the direction in which 
change is most likely. 

As an example of this effect, the pupils who 
stand highest on an achievement measure at the 
beginning of the school year mil probably show 
little if sny increase, in score at the end of the year, 
and may even show a decline. On the other hand, 
pupils who score lov/est at tHef beginning of the 
year will projbably show considerable increase! 
Educators have sometimes been misled by this 
effect and have Assumed that their progratns were 
more functional for low achieving pupils than for 
high achieving pupils, when in reality all that.was 
involved was the regression effect (the statistical 
tenaency for scores to move toward the average). 
Similarly, a group of pupils placed in a remedial 
program because they stand low on a pretest can 
be cLxpected to show considerable improvement; 
but again the improvement may be spurious, as a 
consequence of the regression effect. 

This problem ci;eates real difficulties if pupils 
are tracked on the basis of fall scores and teachers 
^re evaluated on thc1)asis of change in achievement 
of their pupils. For example, assume that pupils are 
" tested in reading in the fall and the lowest third are 
put in Ms. Jones' class, the middle third in Mr. 
Smith's class, and the highest third in Mrs. 
Williagia class. We can anticipate that at the end of 
the year Ms. Jones' class will show much improve- 
ment ancj Mr. Smith's will show modest gain, but 
Mrs. Williams will be fortunate if her pupils show 
"any growth at all. The problem is that the gain the 
pupils show is materially affected by regression 
effect, so to evaluate the teacher on the basis of 
pupil gain would be manifestly unfair. 

There are statistical procedures for attempting 
to eliminate this effect, but as C. Bereiter^ com- 
mented, it is impossible to be certain that appro- 
priate adjuslments^ have been made; and the 
expertise to do even the best that can be done with 
the problem is not widespread. And, of course, all 
the out-of-school influences on achievement 
standing discussed eariier also influence gain. 



r 



Figure 1 

An Illustration of Regression Effect 



High^ 



SecondJQ 
Measure 



Low 




Low 



F^irst IQ. Measure 



High 



ERIC 



56 



56 



although to a lesser degree. So/it is clearly inappro- 
priate to use pupil change as a way of evaluating 
teachers where a teacher may suffer as a con- 
sequence of the error involved 

A procedure for evaluating teachers which 
attempts to bypass the problems of change is the 
performance test or the evaluative teaching unit.^ 
In it, the teacher teaches a prescribed brief unit 
(sometimes as little as a few minutes or as much as 
two weeks) and pupil knowledge is then tested. 
The attempt is made to minimize the problems of 
measurinje; gain by teaching material in which 
pupils^ should have little or no.preknowledge, so 
that all presum|jbly start at the same level. But the 
other problem^S-' of . using pupil achievement to 
evaluate leacheVs still apply. In addition, there are 
questions of w|iether teaching material which does 
not have to be? integrated into previous knowledge 
requires the^amc skills as the usual teaching setting 
and whether such short-term learning generalizes to 
long-term learning. There is the final difficulty tliat 
the performance of teachers on a unit of a few 
minutes does not predict their performance on a 
two-week unit.^ Assuming that either can be used 
to predict year-long performance; then seems risicy. 
Even if the measurement of standing or gain m 
achievement were a satisfactory way of evaUiating 
teachers, there |s still the problem of selecting the 
objectives to be measured. 

Although subject matter acli,ievement has 
been the primary focus of the discussion thus far, 
it is dear that schools are charged with and have 
accepted some degree of responsibility for man> 
other kinds of pupil growth. Over a long period 
schools have given attention to the social develop- 
ment and t|ie moral values of pUpils. And a broad 
view of the relationship between school and 
society suggcs,ts that when a problem emerges in 
the . society, one of the first steps is likely to 
involve the school in solving the problem. JFraffic 
problems .led to driver education; a concerh^^or the 
loyalty of government employees led first to a^ban 
on teaching .about communism in the schools and 
later to the requirement that u be taught; problem^ 
of drug abuse have led to drug abuse education in 
the schools; concem about sexual attitudes has led 
to sex education; concem for occupational choice 
has led to, career education^ in the schools; and 
when concem for^segregation of the racc3 became 
pressing for the society, the first and the major 



attempt to deal with the problem was delegated to 
the schools. To evaluate teachers and schools solely 
on the basis of the subject matter gjiins made by 
pupils grossly underrepresenls the broad range of 
objectives for which teachers and schools* have 
been given some degree of responsibility. Yet for 
many of these objectives there are no measures 
which are immediately, for some even remotely, 
available. U 
Even within the subject-matter realm there 
are problems, which are largely ignored. One of 
these problems is the need to distinguish complex 
achievement growth from simple growth and to 
provide appropriate measurement^ for each. 
Memorization of facts (rote memory) fr.lls at the 
simplest level and '^complicated problem-solving, 
abstracting, arid generalizing fall at the most com- 
plex level; the difference is between, retrieving in- 
formation (memory) and .processing information in 
its \arying degrees of compl<^\ity. There is some 
evidence from a number of studies that the teach- 
ing behaviors which are associated with greatest 
giowt,h in simple tasks are different from those 
which are associated with gieatest growth in com- 
plex tasks. 15,16,17 J 8. 

Mos,t studies of pupil achievement fail to 
make this distinction; and the cunent stress on 
criterion-referenced measurement, emphasizing 
small-step learning, seems likely to focus on simple 
kinds of learning. Measures of complex learning are 
slow and difficult to. construct, in contrast to^ 
measures of simple learning, which can be more 
easily and quickly developed. Evaluatinj^ all subject 
matter at all grade level.s would almost certainly 
require the construction of man> new measures 
whirl} would likely emphasize .simple kinds of 
achievement, given th9 ease with which they can be 
constructed and the emphasis, on criterion- 
referenced measurement. If teachers were to be 
evaluated on the basis of pupil achievement, then it 
seems likely that the teacher who emphasizes., 
simple learning would be evaluated more ppsiti\ely 
than the teacher who emphasizes complex learning. 
Tl;is<^ would be an unfortunate result. 

A further problem related to the difficulty of 
measuring complex achievement giowth is the like- 
lihood that some highly valued objectives gi'ovv too 
slowly to show change within a school year- 
objectives such as complex problem-solving skills, 
citizenship, attitudes, learning to get along well 



57 



with others, and creative expression. On the other 
^hand, it seems likely that measures of short-term 
learning would tend to emphasize simpler kinds of 
learning. 

A description of an application of account- 
ability in England a century ago makes one of the 
problems clear,^^ In tKat setting, teachers were 
eyaluaied on the number of their pupils who 
attained the minimum level of achievemeht ex- 
pected for the particular grade. The result was that 
teachers concentrated their efforts at the minimum 
level of proficiency, with a consequent I0^veringof 
tlie quality of instruction. , 

Another problem of serious consequence in 
the use of pupil measures is raised by the OEO 
study of performance contracting, which found 
that the superior achievement of performance- 
contracting programs disappeared when the teach- 
ing was controlled ,to eliminate the possibility pf 
teaching the test.^^ It seems cleai^ that, in a setting 
in which financial feturn follow? frdm pupil 
achievement*, teaching the test is likely to occur at 
least a portion of the time. This is a very reason- 
able finding and one which is well known, even in 
cases where a financial return is not involved— 
teachmg to the Regents Examination, for examplq. 

A final problem is the possibility of bias if the 
teacher is the test administrator. Even outside test 
administrators have difficulty not helping pu|>ils. 
Where a teacher is affected personally, it seems 



possible that his or her behavior might be in- 
fluenced, even though unconsciously. This problem 
could be dealt with by using only specially trained 
test administrators, but this could be very costly. 

When all these problems in the use of pupil 
achievement for teacher evaluation are considered, 
they become overwhelming. The influence of the 
teacTier is minor compared to the out-of-the-class- 
room influences— pupil ability, previous knowl- . 
edge, th^ home, the peer group, motiyatil)n, and 
othe^. What the pupil brings to the classroom in , 
this^tespect is clearly a much stronger determinant 
of where he or she will stand at the end of the year 
than anything that have been- done in the class- 
room, influences on the development of future 
achievement measures seem likely to Iimit4hem to 
relatively simple measures for some time to come. 
Tests available for measuring the other objectives 
for which the teacher is to some degree responsible 
are relatively few. In additioii to these problems, 
there are statistical difficulties in thctneasurement 
of change which are extremely seriousvif not dis- 
abling. They are sttll further exacerbated by the 
likely problems of. teaching the test, of the teacher 
giving attention primarily to a small portion of the 
students, and of obtaining valid measurement in 
the classroom. 

Taken all in all, this is un imposing array of 
difficulties, most of which have gone unrecognized 
when it is proposed that teachers be evaluated by 
measuring the outcomes of their pupils. 



y 



58 



58 



TEACHER-MADE TESTS-AN ALTERNATIVE TO 
STANDARDIZED TESTS 

by Frances Quinto^ 



^ , Developing tests for classroom use is a routine 
activity for most teachers. These .tests, a fprin of 
the •terion-referenced ,type, serve the n eeds of 
bot ae student and* teacher. They disclose where 
the student stands in relatfon to classroom 
objectives and guide the teacher in providing the 
studen^ wth appropriata^help. Because of the 
information that tests c^xv /offer, they should be 
developed carefully ratker than "off-the-top- 
of-the-head." o . 

What are some elements that can make teach- 
er-devefoped tests effective? How can teachers be 
certain that a test will reveal the kinds of inforrna- 
tioi;i they want? 

Some teachers and test developers have found 
the following procedures to be of value: 

The teacher should decide the purpose of 
giving the test and know hpw the results will be 
used. A teacher can give d test to determine,. a 
group's strengths and weaknesses or to measure a 
class's or an individual .student's knowledge of 
subjett matter. The teacher can give the test upon 
first meeting a class, before introdjjcing a unit of 
study, or after completing a unit or course. A test 
can be a reinforcemr .t activity or an instructional 
device, or it can be the "every- Friday" test which 
helps students and 'instructor to monitor progress 
in an orderiy fashion.^ It can help the teacher 
identify the students who need special help and the 
areas in which they need the help. It can also be a 
means of observing special talents. 

A first step in preparing a test is listing the 
kinds or information to elicit and then deciding on 
\he best format for getting that information. The 
format can be essay, problem solving, computa- 
tional, application of information in new situa- 
tions, multiple-choice, or a combination of these. 
The format of a test is determined by the subject 
area, the kinds of information needed, and the 
amount of time allotted for the test. The teacher 
should decide what number or percentage of ques- 
tions to put in each category. 

Multiple-choice questions are difficult to 
develop,, so they take the most time prior to 



administering the test. Essav questions take less 
time\to develop, but niMch more time to evaluate. 

Just as variety is necessary in classroom 
activity, so it is necessary in tests, 'Hierefore, teach- 
ers will want to include some items thai students 
can work through quickly and olhmlliat they will 
work on for somewhat longer periods of time. 

If reading comprehension or other reading 
abilities are not being tested, then written ques- 
tions and problems, including directions, should be 
easy enough for all students to understand, or else 
teachers should give directions orally,^ 

Generally teachers like the challenge of creat- 
ing new problems. Good t^sts are not simple to 
construct, however. Teachers should save elTcciivc 
items or exercises from year to year. In that case, • 
they should review the items before using them to 
be sure that all, the areas they deal with have been 
coverqd in class. Whether test items are newly con- 
structed or taken from a previous test, the person 
who developed the test or another instructor 
should take the test to be sure it is fair in pr<;senta- 
tion and content. 

Test items may be confiising or ambiguous; 
reviewing the corrected tests will help teachers to 
discover weaknesses. 

Teachers, should not construct questions to 
stump, catch, or confuse, but should state them as 
clearly as possible. In a multiple-choice sequence, 
the right and wrong answers should not be too 
close .in meaning. A "dis tractor"- the wrong 
multiple-choice answer-should not be the correct 
artswer to another question, because this may con- 
fuse students. Tests should heJp students clarify 
their thinking— not confuse it. i 

For essay questions, teachers would be wise 
to devise a key.or scoring guide that they can use , 
for both commenting and assessing. They can 
assign weights to such factors as organization of 
topic, description, and grammar* They should use . 
the key consistently. Teachers of subjects like 
science or sociology must be consistent, too, in the 
weight they give punctuation and giammar in 
evaluating their tests. 



ERLC 



59 



True-false tests, which supposedly measure 
students' command of facts, are the sinfiplest to' 
construct. They should not be used as a major basis 
for judgments about progress because chance— and 
guessing— affect their results. Teachers have argued 
that some credit should be given for items i\oi 
answered on true-false tests, since these can give 
the teacher a truer picture of the groiip'kknowl- 
edge. ^ ' ^ 

Here are acjditional suggestions that can be 
useful in constructing tests: 

r 

t In a true-false test, about one-fourth of 
the items should be false. Student:^ ,.re 
afraid t6 trust their own judgment when 
^ too many are false. 

• Items should be arranged from easy to 
hard so that the beginning of the test 
will give students confidence. 

• The first and last items of a test should 
concern an obvious, mainstream topic so 
that students will leave the test feeling 
satisfied. 



* 6 

59 



• Questions should be worded in the 
positive, and positive and negative ques- 
tions should not be combined in the 
same section of the. test. After a series of 
positive questions, students have trouble 
answering those worded in the negative 
because the mental shift is tricky. 



In an effort to depart from the old-fashioned 
true-false, multiple-choice exercises, teachers may 
want to prepare items that stretch the students' 
imagination. Teuchcrs can prepare exercises that 
allow students to estimate "how ' far,'' "how 
many," "what kinds of," and the like. Students 
enjoy and can benefit from exercises which allow 
them to make inferences, sueh as "What effects 
will be felt in the community as a result of (such 
and such) court case (decision)?" 

Creativity in test-makinj{ is an art. The art- 
fulness does improve with practice, and the im- 
provement can benefit^^'both student and teacher 
alike. 



60 



60 



AN ALTERNATIVE TO BLANKET STANDARDIZED TESTING 

by Richard J, Stiggins 



It is common practice among some public 
school districts to have a committee of teachers 
and administrators annually review the district test- 
ing procedures. The committee usually discusses 
what standardized achievement test battery and/or 
"aptitude or intelligence tests should Be used at. 
what grade levels* And the result is generally a 
rubber stamp on previously used procedures, ,^ 

Recently, however, a number of innovations 
in testing procedures have emerged whicli may 
make rubber stamping inappropriate. They include 
the use of sampling procedures and such innova- 
tions as criterion- and domain-referenced testing, 

For a' number of reasons, however, these 
innovations are not frequently part of the test re^ 
view committee's deliberations. One reason may be 
the committee members' lin^ited knowledge about 
testing, whichjs a highly tc:chnical subject and does 
not lend itself easily to simple explanatipn, under- 
standing, and application, ' 

The institutionali'zatipn of testing procedures 
has also contributed to the lack of knowledge and 
applications of innovation in educational testing. 
Or as Samuel Superintendent might put it, "The 
board of education will never l6t us give up 
standardized testing.'' 

. A third and final reason for the lack of impact 
of educational measurement innovatibn on public 
education is the minpr role that tests play in edu- 
cational decision making. Because of their general 
nature, standardized tests are limited in their 
ability to contribute' to district, school, br class- 
room level decisions. Many educators may 
recognize this fact and yet continue to give the 
tests because their use satisfies tl?e board and the 
public. 

In brief, a lack of technical knowledge of test- 
ing has given rise to the three factors stated above^ . 
each o{ which in turn prevents the gaining of new 
knowledge of testing. The result is a closed system 
of development in educational testing which resists 
the implemeiilation of practices and procedures 
common to other fields. 

One such practice is sampling, a procedure fof 
increasing the efficiency of data collection that is 
gaining prominence in educational testing as a re- 



sult of recent large-scale testing programs, such as 
the National Assessment of Educational Progress. 

Random sampling is a statistical procedure 
-which allows such social scientists as Gallup, 
Harris, and other pollsters to draw general conclu- 
sions about the attitudes of an entire population 
on , the basis of a very few scientifically selected 
respondents. Survey participants are randomly 
selected to be representative of larger groups, thus 
allowing for efficient, less expensive, and quite 
accurate conclusions. 

And so it can be with achievement test scores, 
In situations where testers want to draw general 
conclusions about large groups of students, such as 
an entire grade level for a district, a properly 
selected sample can- yield very accurate estimates 
of "typical student" performance. 

Another innovation in educational testing 
situations, matrix sampling takes advantage of just 
such a random safnpling procedure to increase 
efficiency by reducing the number of students in- 
. volved in testing- But there is another sampling 
dimension. Not only is it unnecessary for each stu- 
dent to be tested to generate accurate grou^ esti- - 
mates of academic performance, but it is un- 
necessary for every student to respond to every 
lest item. 

Matrix sampling involves the simultaneous 
random sampling of both students and test items. 
It involves, however, different, nonoverlapping 
samples of students taking nonoverlapping samples 
of items so that each matrix sample is a sample of 
students taking a sample of items. 

This requires a set of items to be partitioned 
randomly into several subsets and each subset given 
to a different sample of students. For example, if 
there are 50- items, they could be partitioned into 
10 samples of 5 » items each and each Sample 
randomly assigned to 1 of 10 samples of students. 

The procedure reduces the number of stu- 
dents and the amount of class time required to 
generate the desired data. When responses to items 
are summarized, the results may be generalized to 
both the entire test from which the items were 
derived and the entire population of sludents from 
which the sample was selected. It is important to 



JC 



.61 



61 



note, however, that no information is gathered on 
individual pupil performance. A matrix sample! pro- 
vides only group estimates. ^ !^ 

Let me illustrate why matrix sampling might 
be usefuf and appropriate for an annual district- 
wide standardized testing program. 

I will argue tliat the only truly legitimate con- 
cerns of standardized testing in any district are 
general conclusions about the entire student body. 
.Testing is one process of gathering information for 
decision making. In education, decisions have to be 
made at a number of levels. ^ 

First, we- must make diagnostic and prescrip- 
tive deciaons regarding individual students. Second, 
we must make decisions regarding the viability of 
specific educational programs. (This^ is the newly 
emerging concern for program evaluation.) In addi- 
tion, building administrators must make general 
school-level decisions. Fmally, superintendents, 
boards 6f education, and the public must make 
4istrict'-level decisions. 

In most districts, the information on educa- 
tional outcomes required for many of these multi- 
level decisions is typically generated from the 
9 annual administration of a standardized achieve- 
ment battery. The computer scoring service is then 
able to return individual pupil scores, class aver- 
ages, building averages^ and district summaries, all 
for about 75^ to $1 per student^This seems most 
economical until one considers what actually 
happens to these test scores and summaries. 

First of all, at the classroom level, these scores 
are designed to discriminate apong. students **lo 
help with diagnosis/.' However, I challenge anyone 
to diagnose and pircscribe from a grade equivalent 
of 3.2 in a general gross construct called "total 
reading." Most teachers recognize that such trans- 
formed scores contain too little information to be 
diagnostic or prescriptive. 

The test publishers argue that teachers can do 
individual item analysis to reveal specific weak- 
nesses, but any teacher who has attempted to do 
this realizes ho\y tedious this task can be. 

At the specific program level of decision mak- 
ing, standardized achievement batteries also fall 
somewhat short of necessary data requirements. 
The qualities of it^ms^selected to allow the test to 
discriminate among students make it very difficult 
for them to detect specific educational program 
' impacts. The items are sinipl^. top short, too gen- 



eral, and too individualized to be sensitive to local 
instructional interventions. 

For. example, correct responses to "four addi- 
tional test items represent a year's^growth in grade 
equivalent terms between grades five and six on the 
Iowa Test of Basic Skills, 'Form' 5, Level 12, 
Arithmetic Concepts. Not only is it totally unfair 
to characterize an individual learner's year of 
growth so narrowly, but as a program developer, I 
would be quick to challenge an evaluator who 
selected such an imprecise tool to demonstrate the 
viability of my newly developed instructional 
sequence. From a program evaluation point of 
view, instruments more sensitive to local program 
objectives are much more desirable for program 
decision making than are any national standardized 
examinations. 

Many of the problems which arise from using 
standardized tests as criteria for judging specific 
pr9gram qu-aty also arise wbrcn one attempts to 
differentiate among general program elements, 
such as classes, teachers, departments, or buildings. 
Because of the lack of sensitivity of these tools, 
there is little or no educational research delineating 
any causal line between .program elements and 
standardized outcome.measures. 

To say that one school's learning environment 
is better than another's or one principal is more 
competent than another on the basis of stan- 
dardized test data is a total xnisusc of the data. Yet 
the summaries returned to districts by scoring ser- 
vices .and comments of educators would suggest 
that this is tlie intent. . - 

To date, educational research can establish no 
significant stable links between any teacher, admin- ' 
istrator, or building characteristics and differential 
standardized achievement test scores. Therefore, it 
is quite apparent that standardized test scores are 
incapable of contributing to specific and general 
program-related decisions. 

What, then, are these tests capable of doing? 
Very simple, they arc useful as gross indicators 
which can best serve as information for communi- ' 
cation to the public on the state of achievement in 
a given district. In fact, it may be that this is the 
only real use they are put to in most districts any- 
way. 

If these tests are really incapable of con- 
tributing to important specific decisions, then their 
use for public relations is their only appropriate 



ERLC 



62 



usc^ In that case, a testing system which yields only 
the district average data would be sufficient. Such 
a system can be created by using sampling pro- 
cedures^ An investment of hundreds of dollars for 
sample data can provide information of the same 



value as that previously gathered for thousands. 
The dollar savings can be used to gather other, 
types of outcome data which are prescriptive and 
appropriate Tor program-related decisions. 



B3 



63 



A SUMMARY OF ALTERNATIVES ^ 

I Contracts with Students 



NEA Resolution on Standardized Tests 

76-65. The- National Education Association 
strongly encourages the elimination of group 
standardized intelligence, aptitude, and 
achievement tests. 



In a final report to the NEA Representative 
Assembly, the NEA Task Force on Testing recom- 
mended that alternatives to group' standardized 
j testing be developed. In keeping with that recom- 
i mendation, the following- brief descriptions of 
alternatives are presented. 



Anecdotal Records 

Recording the behaviors of individual stu- 
dents reveals 'more about a student than do test 
results. A teacher can develop a composite picture 
of a student by observing and recording behavior 
such as. interaction with others, motivational 
.'patterns, and independent work habits. 

Teachers who have set about keeping ^ 
anecdotal records report that not only is the 
experience satisfying lor them, but that they im- 
prove with practice imd constantly get new insights 
into a student that either support or explain other 
evaluation results. 



Oral Presentations by Students 

Students* oral presentations have long beeii 
Accepted as We way to evaluate student progress. 
For c.xample,' skilled reading teachers use them 
both in evaluating student progress and in diagnos- 
ing specific difficulties. Subject matter teachei^ can 
assess both ;<lepth of knowledge and personal 
capabilities using this mode of performance. 

In ^assigning oral presentations to students, 
teachers must state clearly beforehand what is 
expected of the students and what are the criteria 
for a goo^ performance. The structure offers the 
opportunity for self- and peer evaluation, partic- 
ularly whfcn oral presentations are recorded. 



A contract or agreement between student and, 
instructor specifies tasks that both parties , must 
complete within a given period of time. An in- 
- structor tarefully poses problems with varying 
degrees of difficulty;, and students, with teacher 
guidance, select both the problems they will work, 
on and the amount of time they will spend on 
them. Ability to perform the tasks is measured by ' 
promptness and accuracy in completing the con- 
tract items. Requiring students to work alone, as 
contracts generally do, adds to an instructor's 
understanding of a student's academic and personal 
development. 

Student Self- and Peer Evaluation ' ' 

Students can be, and ought to be,- involved in 
evaluating their own work. .The ability to assess 
one's performance is useful in ways that transcend 
school learning. Students can become more in- 
sightful about themselves and their approach to 
work. 

Selfv- and peer assessment is complex,v partic- 
ularly when more than simple student products are 
being judged. When expressing subjective judg- 
ments, students tend to underrate rather than over- 
rate their own abilities and achievements. Because 
of this„spme background and skill on the part of 
teachers is required in dealing with sensitive affec- 
tive areas and development of self-concepts. Many . 
well-developed approaches exist for such purposes. 

Parent-Teacher Conferences 

The purposes of most parent- teacher con- 
ferences are for the particS^ to exchange infor- 
mation about a student that will help to guide him 
or her into productive channels and to find ways in 
which he or she can secure satisfaction and growth. 
These conferences are valuable when there is. 
, thoughtful preparation and when they are used as a 
supplement to written evaluations. They also offer 
an excellent opportunity to relate schooling to the 
home— a necessary adjunct for a vast majority of 
students. 



64 



Objectives-Referenced (Criterion-Referenced) Tests 



School Letter Grades 



The potential advantage of these tests over 
standardised tests is that students are judged on 
their mastv'^ry of objectives rather than on their 
standing in relation to others. In this way, they 
serve some diagnostic purposes. At this time, these 
tests have not had wide enough use to confirm 
fully their value, particularly when broad subject 
areas must be considered. If the original intention 
of criterion-referenced tests is not distorted, they 
have great potential as an alternative to stan- 
dardized tests. 



' Individual Diagnostic Tests 

These sophisticated evaluation devices are 
reliable and' valid for specific purposes. They are 
underused because administering them is fre- 
quently time-consuming and expensive* Also apply- 
ing, scoring, and analyzing individual diagnostic 
tests often requires special training that is generally 
not*avai!ablc to teachersi ^ 

Nevertheless, these tests have the potential of 
providing additional information the teacher can 
use in prescribing learning strategies for a student. 



Teacher-Made Tests 

Good examples of objectives- referenced tests 
are those constructed by teachers for their own 
use. Thesc-can- closely reflect the content and 
emphasis of classroom subject matter, and teachers 
can use the results in making decision^j that are as 
diverse as the pace of instruction; prescriptive 
assignments; reporting to or conferring with par- 
^enl«; and promoting or retaining a student. This 
broad range of decisions requires that teachers be 
familiar with methods of constructing classroom 
tests" which will measure, both factual knowledge 
and higher levels of thinking. 



Conventional grading systems' of A, B, C, D, 
or E or equivalent designations (percentages/ 
averages) are, for the most part, understandable to 
parents and acceptable to students. Giving grades 
can be particularly valuable when teachers use 
descriptions to expand and to clarify the meaning 
of the grades. Even such limited descriptors as 
excellent, satis^factory, nJid-^Odeds ' improvement 
may be more useful to parents and students than 
the standardized test statistic of "10 standard score 
points above the mean." 



Open Admissions 

In a strict sense, the policy of open admis- 
sions is not an alternative to testing but a practice 
that indicates change in college requirements. It 
could eliminate the iiecd for standardized tests. 

Many institutions of higher education now 
accept :\II students who have completed high school; 
they give no consideration to scholastic-aptiitude or 
achievement tests. Some universities" admit fourth 
year high school students into their freshman 
classes upon the recommendation of high schdol 
teachers. Adult education arid life-long learning 
programs provide access to degree-granting pro- 
grams for working people. Combinations and varia- 
tions of the above offer opportunities which in the 
past would have been highly unusual if not totally 
unacceptable. 

The criteripn for admission to these schools is 
the desire to be educated rather than the score 
achieved on a standardized test. 

The full story is not yet in on the succes:. of 
these programs, but th^ concept of openness 
provides for more equitable educational oppor- 
tunities for all. This holds true for other levels of 
schooling. 



TESTS AND USE OF TESTS: 

NEA CONFERENCE ON CIVIL AND HUMAN RIGHTS IN EDUCATION, 1972 



£6' 



67 



SOCIOCULTURAL FACTORS IN THE EDUCATION OF BLACK AND CHICANO CHILDREN 

by Jane R, Mercer 



, Studies dating back to the 1930*s have 
demonstrated the cultural biases inherent in 
• tests and other standardized achievement measures. 
Yet .clinicians have continued to interpret chil- 
dren's performances on these tests as if there were 
no biases and have never systematically taken 
sd^ciocultural differences into account when 
interpreting the meaning q{ a particular child's 
score. Consequently, we find many children in 
classes for the mentally retarded whose adaptive 
behavior in nonacademic settings clearly demon- 
strates that their problems are school specific and 
that they are not comprehensively incompetent. 

Disproportionately , large numbers of .Black, 
Chicano, and probably Puerto Rican children are 
- labeled mentally retarded by the public schools. In 
California, the rates for placing Chicano and Blaqk 
children in classes for the mentally retarded are 
two to four times higher per thousand than for 
English-speaking Caucasian children,"or Anglos. 

The public schools label as retarded a large 
number of children who are not so regarded by 
their families, neighborhoods, churches, or other 
community organizations. We asked 24\ organiza- 
tions in a Southern California city for information 
on each retarded person they were serving. The 
public schools listed far more retardates than any 
other formal organization, shared their labels with 
more other organizations, and labeled as retarded 
more persons with IQ*s above 70 and with no 
physical disabilities. ITiere were 4V2 times iS many 
Chicanos and twice as many Blacks in public * 
school classes for the mentally retarded as would be 
expected from their proportion in the population, 
and o^ly halt as many Anglos as would be ex- 
pected The Black and Chicano children in. these 
classes had higher . IQ*s and fewer physical dis- 
abilities than the Anglo children. While we found 
no evidence that these ethnic disproportions re- 
sulted from a conscious policy of discrimination, 
the labeling process is clearly Anglocentric. 

We then sought to identify the aspects of the 
clinical assessment process that produce ethnic dis- 
proportions, by testing a representative sample of 
6,907' persons in the community*. We used the 
American Association for Mental Deficiency's 




definition of mental retardate as a person 
subaverage in both gen;;ral intellectual functioning 
and adaptive behavior and developed a series of 28 
age-graded scales to measure adaptive behavior* We 
also used standardized measures of intelligence, 
mainly the Stanford-Binet LNI and the Kuhlmann- 
Binet. 

We found that the educational institutions* 
definition of mental retardates as those with IQ's . 
of 79 or below— the lowest 9 percent of the popu- 
lation—is one factor producing ethnic dispropor* 
tions in the labeling process. Wc concluded that a 3 
percent cutoff— IQ below 70— is most likely to 
identify persons in need of special assistance and 
least likely to stigniatize those who per^^rm a 
normal complement of social roles. 

Most psychologists give only an IQ test when 
making assessments of mental retardation. How- 
ever, we found that 60 percent of the Chicanos and 
91 percent of the Blacks in our sample who had IQ 
test scores below 70 passed the adaptivje behavior 
measure, while none of the Anglos with IQ's this 
low were performing normally in their social roles. 
The IQ test is obviously not a valid predictor of 
social role performance for Chicanos and Blacks, 
although it seems tc do a good job for Anglos. 
Schools should adhere to the AAMD definition of 
mental retardation and develop a systematic 
method for measuring adaptive behavior as well as 
IQ in making psychological assessments. A child 
should have to fail both criteria before being 
labeled mentally ^ptarded. When we followed this 
procedure, ethnic disproportions were reduced but 
still not completely eliminated. 

The IQ tests now being used by psychologists . 
are, to a large extent, Anglocentric^^c found that 
about 32 percent of the differences in IQ test 
scores in a sample of 1,500 Blacks, Chicano, and 
Anglo elementary school children in California 
could be accounted for by differences in the socio- 
cultural characteristics of their families. Un- 
fortunately, most psychologists treat a score as a 
score. When social background was held constant, 
there was no difference between the measured 
intelligence of Mexican-American and Black chil- 
dren and the Anglo children on whom the test was 



68 



standardized Wc concluded that diagnostic pro- 
cedures in the public schools must he broadened to 
reflect the pluralistic nature of American society 
and must involve securing information beyond that 
ordinarily used in public school assessment. 
^ A major concern of parents we interviewed in 

our studies was the stigmatization of their children. 
Their children were ashamed to be seen entering 
the mentally retarded room and dreaded receiving ' 
mail that might bear compromising identification* 
The parents were, also concerned about the quality 
of the educational program in the sclf-containvd 
special education class. Their children were not 
taught to read as they would be taught in the fegu- 
hr classes, and many saw the program as a 
"sentence of death." We followed a group of 108 
children in special classes «• for several years; only 
one in five ever returned to regular classes* The 
others aged put, dropped out, or were sent to other 
special program^s. 

It would be a tragedy, however, if special edu- 
cation progiams were jeopardized because of 
inadequacies in assessment ^,procedures and pro- 
gramirig. I believe that there are viable alternatives . 
to present practices without resorting to dumping 
^ special education students and cutting special edu- 

^ cation funds. School psychologists should be, re- 
squired to ei^large the scope of information tliey use 
in making educational decisions by regularly and 
systematically studying students' adaptive behavior 
in nonschool situations. If a child performs 
adequately in ihese settings, his or her problems 



are school specific and will need special tutoring, 
programed learning, cross-age , teaching, and 
remedial reading, rather than a self-contained class- 
room. School \sychologists should also secure in- 
formation about a child's <-ocioculturaI background 
to use in interpreting his or her IQ test score and 
developing pluralistic norms. The ^child's per- 
formance should be compared npt only with the 
performance of the general population, which is 
composed primarily of Ang|o children, but also 
with the performance of other children from his or 
her own sociocultural backgrotind. 

I-do not agree with those who say we must^ 
stop all educational labeling or IQ' testing. Our 
problem has been that our labels are too few and 
too cnidc.x We need a more sensitive system for 
identifying children in need of special education 
and a continuum of special educ-tion programs 
carefully targeted for children with specific needs. 
One of the most distressing developments in some 
regions has been the precipitous reassignment of 
many children to the regular classroom from self- 
contained classrooms with no continuation of 
special services. These children must continue to 
receive special ^education, and that mpney must 
^ .. conrinue to^be provided. Financial support and the 
effort of special education teachers should be re- ' • 
directed toward providing a wider variety of special 
services to keep children in the educational main- 
stream and to educate each to his or her maximum 
potential. - 




68 



ERIC 



69 



BIAS IN TESTING . 
by William /•". Brazziel 



Thousands^ of' minority children arc denied 
equal access to quality education each year because 
of flaws in the testing, apparatus of our schools. 
This situation, which is both illegal and immpral, is 
becoming less and less tenable. At least 20 class 
action suits around the country^ arc seeking to 
force school districts to cease and desist in the 
inaccurate testing of minority children. In the 
biggest suit, in California, the NAACP and several 
civic groups? arc seeking a. dissolution of classes for 
the retarded and a moratorium on testing until 
better, more precise instruments are dcyiscd. 

The testing problems of minority children 
begin on the fir^tday at school, when they have to 
take^^procisc tests and have their scores recorded 
dn cymulative record. Should a child*s score, be 
l<ks than 100, the teachers will not work diligently 
wfith the.child, giving the child less than his or her 
snare of.attention and assistance. The child will get 
more tjian the ordinary share of slights and 
indignities as he or she moves through the schools 
and will be denied access to a college preparatory 
curriculum. 

More and more people are* losing confidence 
in the'^- schools, and the spectacle oT a testing 
apparatus m disarray will do little to restore this 
confidence. The $300 million testing industry must 
corhe up with better instruments. The schools must 
eliminate injurious instruments, and psycho- 
metrists and teachers must be retrained to ensure 
that tests become a part of the solution rather than 
the problem in American schools. Resistance to 
this reform movement will come frohi long-time 
test consultants, conservative school people, 
racists, teachers of teachers, and the psychometric 
profession. 

^ Henry Dyer, vice-president of the Educational 
Testing Service and the dean of American psycho- 
metrists, says that IQ tests are. the most useless 
source of educational controversy ever invented 
and that schools which have not yet dropped them 
should do so forthwith. He i\^^tes that a more 
sophisticated testing apparatus could he (developed 
for schools with a heterogenous population, but 
that the continuation of IQ testa would preclude 
this: Dyer is right. The very best^test in America 



was standardized on 4,400 children, most from 
.California suburbs. Neitlier minority children nor 
children froih the Southeast were included^ in tlie 
sample. 

Dyer suggests more school programs based on 
the philosophy of Jean Piaget. In the Piagetian 
school, teacliers, tests, and curriculum arc viewed 
as resources to maximize each child's development. 
Instead of being slapped in the face with a biased 
IQ test, a child takes a battery of sophisticated 
tests designed to ascertain his or her language 
development, ^, comprehension, symbol manipula- 
tion, discriminant analysis, and other skills. None 
of the child's scores are recorded in a cumulative 
folder. Criterion-referenced tests are used instead 
of the norm-based achievement tests, so chijd 
r..cing is eliminated. In the Piagetian school, parent 
conferences and skill sheets replace report cards, 
and continuous progress learning replaces promo- 
tion from grade to grade. 

Piagetian teachers sometimes igive environ- 
mental or culture-specific tests, in \>Shich familiar 
concepts from the child's neighborhood are used to 
ascertain his or her ability to think. VVMississippi 
child, for example, might be asked to match single- 
tree, lespedeza, swectmilk, tedder, dasher, hame- 
strap, blue tick, and walker, instead of sonata and 
bas-relief. The content of any test is irrelevant in 
the Piagetian school. 

Culture-specific tests are not new, but the 
testing (orporations ha\e found it unproductive to 
develop tests for each of the 40 or so cultural 
^ groups in this country. The government and the 
corporations should have such tests available in a 
few years, hut in the meantime school systems and 
individual teachers will have to make their own. 
There is no great mystery about test making; over 
10,000 tests are on the market. Like the struggle to 
get publishers to market integrated textl'^ooks, this 
movement will probably have v., resort to teacher 
and school system efforts to prime the pump. 

llie situation concerning criterion-referenced 
tests is better. These achievement tests, which 
measure only what has been taught- in a particular 
module, by a particular teacher, and in a particular 
time span^come with such model programs as 



69 



70 



, Distar and individually prescribed instruction. 
Their value lies in thdir elimination of the need to 
have losers in the testing game who make it 
possible for others to succeed. They focus on 
growth and behaviorally oriented goals and so 
benefit both low-income and niidclle class children. 

Nothing that is wrong with the testing pro- 
grams in^ American schools is immune t(Khard 

^ work, imagination, and nerve. The denial of dqual , 



access to quality education is criminal, and the 
public will not tolerate it. If a few in our midst are 
resistant to change, we shall find oui selves caught 
up in an embarrassing maze ol court suits, reduced 
budgets, performance contracts, school vouj^ers, 
and steadily erodmg public confidence. Our job is 
to make sure that this does not happen and that 
America!) schools are modified to serve well all the 
children of all the p6opic. 



ERLC 



70 



71. 



USE OF TESTS : EDUCATIONAL ADMINISTRATION 

by Jose A. Cardenas 

^ The purpose of evaluation and testing i^ deci- ever, in rural areas, there is no plare else to go. In 

sion making. Tests— of intelligence, achievement, some families, the parent and not the child is r 

and personality— provide information for making supposed to make such a decision. During my 

decisions about continuation, promotion, gradua- youth, I was sent for tortillas, and the purchase of 

tion, special assignment and placement, diagnosis a loaf of bread was unheard of in my house until I 

and prescription, student feedback, motivation, was 15 years old. 

and evaluation. I will focus on the intelligence test. Testing methods can also 'be incompatible 
The most serious problem in the assessment with a culture. For example, some tests emphasize 
of intelligence and the use of intelligence tests is ^ competitiveness, but Mexican American children 
the assumptions that are made. One assumption's perform better in a cooperative situation. 
. that intelligence is intangible and not directly Our assumption that an individual has the 
measuyable. Another is that intelligence can be ability to verbalize can harm children with physical 
indirectly measured . by assessing some form of disabilities. A child who stammers will probably 
behavior or performance that is assumed, to be score low on a verbal or language test even though 
solely dependent on intelligence and not related to this disability is unrelated to the child's intelligence, 
other variables such as related understandings. Motivation is ^iissumed not to be a factor in- 
prior learnings, and motivation. A third assumption fluencing a child's performance on intelligence 
^> that the evaluation of behavior and performance tests, yet a test may be highly motivating for some 
requires written, oral, or nonverbal interaction be- children and totally inhibiting for others. We also 
tween tester and testee. A fourth is that test results assume that the iestee-tester relationship is not a 
give little or no information unless a comparison is dependent variable, yet many studies indicate that 
made between the performances of the testee and Black and Mexican American pupils perform better, 
the norm group. We also assume that the charac- on intelligence tests when they are administered by 
teristics of the two arc compatible. Black and Mexican American administrators. Score 
Research indicates, howeyer, that more is un- reliability is, likewise, assumed, but how smart you 
known than is known about intelligence and that are is more dependent upon who score's your test 
the assumptions and methods of testing are not than upon your intelligence or your performance, 
always valid. Performance can be based on more The second major problem in testing is the 
than the one variable of intelligence. Our assump- dysfunctional responses we sometimes use to try to 
tioq that;^all other factors are equal or are nan- remedy our invalid assumptions. It is simplistic to 
dependent variables that have nothing to do witlT^ " ^.give a Puerto Rican .child a Spanish translation of 
intelligence is false. Language facility, reading ' the WISC or the Peabody Picture Vocabulary In- 
ability, and cultural compatibility all influence test ventory. Some English stimulus words become 
scores. » • Spanish paragraphs when translated properly. I 

have never heard a one-word translation of "cream 

Too many intelligence tests assume that all puff," yet this word is on the Spanish version, 

children have had common experiences, for Also, the Spanish equivalent of an English word 
example, that they arc all familiar with snow. We ^ may be on an entirely different level of difficulty, 
haven't had a snow holiday for the hist five years in * No matter how good the Spanish translation of a 

the Edgevvbod school district, and most of the kids test, it must also take into account the regional 

have never seen snow, yet some qjoup intelligence . variation of the language. When I once told a test 

tests ask about the use of a^ sled. One question on translator that her translation must be regionalized 

the WISC asks, "If your .nother sends you to the to be valid, she protested the expense and said, *^If 

store for a loaf of bread and there is none, what do the kids don't know this type of Spanish, that's 

you do?" The child who answers, "I go back their problem." So Spanish-language tests can be 

home,'' is considered to be intellectually inferior to just as invalid as English-language tests for 

the child who says, "I go to another sto.^e." How- Spanish-speaking children. 



71 • 



Bilingual children who take Spanish-language 
tests receive no credit for their knowledge of 
English and, in fact, are penalized. The failure of a 
bilingual child to respond to the stimulus word 
martposa supposedly indicates lack of intelligence, 
even though the child may understand the word and 
concept of butterfly. When bilingual children are 
administered a Spanish-Ia^nguage Peabody after 
having taken the English version, they select the 
same responses they made in English, even though 
they now see that some of them are wrong. Even 
using bilingual tests does not solve the problem; 
scores have been shown to vary according to 



whether the child's dominant language appears first 
or second on the page* - 

We must protect all children against invalid 
testing. We must reeducate educational personnel 
and perhaps discontinue intelligence testing, at 
least until the reeducation is complete. We must 
develop new and fuucfional techniques for measur- 
ing intelligence and establish different criteria for 
mtiking decisions. Above all, measures should be 
taken by the 'National Education, Association and 
organizations of school administrators and 
counselors to protect children from the invasion of 
privacy through testing. 



EMC \/ 



72 



73 



USE OF TESTS: EMPLOYMENT AND COUISfSELlNG 

by Thelma Spencer ^ 



Tests are widely used in counseling, but their 
^validity depends on the interpretation (^f scores 
and on the situational appropriateness of the tests. 
The interpretation of scores often indicates the 
poor quality of the counseling to which students 
arc exposed. 

« 

. . . [A] young girl had received honors in junior 
high school and top scores in every standardized test 
she took. As she was about to graduate she was given 
an interest inventory. According to her responses, the 
girl*s major interest was in clerical work. 

The school guidance counselor met with the girl 
and informed her she should become a secretary. Not 
' only did the counselor tell her she would be happier 
in a commercial course in high school, he insisted that 
she would be unabje to cope with the intellectual 
demands of a liberal arts college. In his ignorance the 
guidance counselor had confused a questionnaire that 
supposedly ^^^ows what people like to do with 
aptitude tests that attempt to measure what people 
can do J 

Bad counseling can take' other forms, as in 
this incidtnt. related^ by a New York high school 
student: 

. . . [T) he people in my section were never told 

about the PSATs ITie kids in the Honor sections 

took those PSATs as a matter of course. In fact, the 
teachers strongly urged that they take those PSATs in 
order to get some practice for the next year's SATs. 
Ai^d I didn^i take a PSAT until I got into my senior 
year, and then I found out I had to take an SAT. 

That sticks in my craw. We were all students and 
we supposedly W(?re^alJ aspiring to college.^ 

In both of these cases the system worked 
against the student, but the first student is repre- 
sentative of the type for whom the system is 
supposed to work. The second is representative of 
those for whom the system works very seldom. 

Tes^s also mediate against students. Some 
children are tested more than they are taught and 
curriculum is often based on what can be nxQ^sured 
by existing instruments. Too mafiy counselors 

73 



equate a test score with the intangibles of 
, personality— motivation, desire, and ambition— and 
then ignore personality in favor of a preconceived 
notion, about the individual being counseled, "The 
best of our tests," as Oscar Buros says, **are still 
highly fallible instruments which are extremely 
difficult to interpret with assurance in individual 
cases. Many counselors iind teachers victimize 
youngsters by equating the individual with the 
norm and the testee group with the norm group. 

Some people say that the evolution of a stable 
examination system has helped create our much 
vaunted stable education system, but while the 
SAT may permit admissions officers to select stu- 
dents most similar to the norm group, those stu- 
dents come from schools whose programs are based 
on what the colleges offer. The implications of this 
vicious circle are staggering: too many poor and 
minority students who fail to meet admissions 
criteria ;n*e guided into general, vocational, and 
' commercial chisses. 

Sometimes a student succeeds despite a coun- 
selor's or teacher's doubts about his or her ability 
to do so. Many students never even see a counselor, 
or if they do, the counselor is a disciplinarian and 
attendance taker, not ti helper and adviser. For the 
majority of students, adequate counseling and 
guidance are myths. ^ 

Using test scores to determine employability 
is another gross misuse o^ tests. For example, the 
National Teacher Examinations have been used by 
some to determine Who should be retained when a. 
southern school district is under court order to 
dissolve its dual system. When the school siystcm in^ 
Columbus, Mississippi, required a cutoff score of 
1,000 on the NTE for retention in the unified dis- 
trict, eight Black teachers were not rehired, hi 
Starkvill?, Mississippi, the school district used the 
Graduate Record Examination to determine teach- 
er retention, although it granted provisional status 
to teachers with NTE scores of 500 on both the 



74 



common aiid teaching field sections. These districts 
equated a test score with competence in the class- 
room. In April 1971 Judge Orma Smith ruled that 
personnel selection based on NfE and GRE results* 
was discriminatory, as more whites than Blacks 
scored above the cutoff point in both districts. In 
Griggs et ah v. Duke Poiver Company the U.S. 
Supreme Court ruled that tests for employability 
violate the 1964 Civil Rights Act "when the rate of 



rejection is higher for Negroes than for whites and 
there is no showing that the passing of such tests- 
is significantly related to the successful per- 
formance of the job." I ' ' ' 

Test scores are guides only,, and the NTE 
score is merely another piece— by no means the 
most important piecc-of information about a 
person. This test, or any test, is only as good as the 
people who use it. 



' • ■ ■ 75 

I, 

MISUSE OF TESTS: SELF-CONCEPT 

by Robert. L.Williams 



The problem of testing Black children is very 
serious. Biased tests are not just a violation of civil 
rights, but are a form of Black intellectual 
genocide. The whole American educational system 
is unfair, and the argument against tests is used as 
one instrument to open the door to change the 
whole system. 

What do we mean by intelligence? Is it what 
the intelligence tests measure? Is it a global 
' capacity to deal with one's environment? I offer 
the rubber band theory to illustrate my definition 
of intelligence. A rubber band will stay in its re- 
laxed .<;tate unless stretched to its capacity by an 
outside force. Genetics or heredity determines an 
individual's potential stretch, and the environment 
determines the extent to which he or she reaches 
this potential. 

Test it^ms drawn from white culture penalize 
Black children, whereas items dravm from Black 
culture penalize whites. I have de\?eloped the 
BITCH test-Black Intelligence Test of Cultural 
Homogeneity-vvith items drawn directly from 
Black culture. A child who knows Malcolm X's 
birthday or the date of his assassination shows as 
much intelligence as the one who knows Washing- 
ton's birthday. IVe never seen the work pick 
illustrated on the WISC-only comb, which is 
something I can't use. 

The three criteria for a test— validity, 
reliability, and standardization-exclude Black 
people. A test is valid when it measures what it 
intends to measure. Currently used ability tests do 
' not measure Black intelligence. If a test asks, for 
example, "What should you do if you find a purse 
with five dollars in it?" Black children will say, 
"Keep it"— a culturally determined response. They 
will say what their environment has dictated that 
they say, but on the standardized test, they will be 
marked zero. 

Reliability means test consistency, i.e., a test 
will yield the same score or rank an individual in 
the same place each time. Since the standardized 
tests are 'scored subjectively, and since they 
validate only mainstream cultural responses, they 
cannot be reliable. 



Standardization refers to the extent to which 
the sample on which the test is based represents 
the people who will take it. Several of the major 
ability tests excluded Blacks, Mexican Americans, 
and Puerto Rjcans from their standardization 
samples. The Stanford-Binet, WISC, and Peabody 
systematically excluded Blacks from their samples. 
If a t^^Ms not standardized^ on a particular group, 
it probably does not represent that particular 
. group and should not be used on its members. 
Standardization is one reason for the 15-point dif- 
ference in IQ between Black and white kids. The 
discrepancy means simply thaj^ the test is biased. 

Arthur Jensen's research has repeatedly 
shown that tests are biased— that Black and 
Chicano kids who have IQ's in the 60-to-80 range 
score much higher on learning Tests than do white 
kids with the same IQ. Jensen's interpretation is 
that Blacks show more associative or IjQvcI I learn- 
ing than whites; if you ask Black and white kids to 
recite six digits backwards and forwards. Black kids 
do better than whites with the same IQ. An al- 
ternativejiiterpretation is that the biased IQ under- 
estimates the ability of Black children and indi- 
cates that they are clearly superior to the white 
kids who scored within the same range on the 
test, because the test was standardized on the 
whites/ 

Some people argue, "The tests do exactly 
what they are supposed to do. They predict 
scholastic success." Ability tests (X) predict a 
criterion (Y), such as a child's performance in the 
classroom. The hidden fallacy is variable Z, which 
might he unfairness, motivation, anxiety— anything 
that influences X (test scores). A fair test and a fair 
criterion will produce a high correlation bqUyc^n X 
and Y: white people who do well on tcsvsdo well 
in school. The unfair WISC and a fair classroom 
will produce a low correlation between X and Y: a 
Black child who, does poorly on the test will do 
well in the classroom. Another combination is a 
fair predictor, such as the Davis-Eels Games test, 
and an unfair criterion-the culturally biased class- 
room. This combination will also yield a low 
correlation: Black kids will do well on the test and 

75 



:76 



poorly in the classro»)m. Vyith an unfair predictor 
and an unfair criterion-the classic situation for tlie 
Black child— the correlation is high: the Black cliild 
who does poorly on the test also does poorly in the 
classroom. 

_ After I administered the VVISC test to about 
500 Black kids, I then gave the BITCH test. Of the 
420 children in the low WISC group, 75 to 80 
percent scored high on the BITCII. I still have to 
examine other criteria to see how well the BITCII 
scores correlate with scholastic performance, but at 
lea^t I know that most of the Black children who 
scored low on the VVISC are neither educationally 
mentally retarded nor in the borderline defect've 
range. 

At least four court suits are now pending on 
the use of standardized tests in San Francisco, San 
Diego, Boston, and S.t. Louis because they violate a 
child's "constitutional rights under the Fourteenth 
i'^mendment. 

The Boehm's test of 50 basic concepts is 
clearly written for white folks. It asks the child to 
select the picture that shows "behind the couch" 
or "under the table." A Black child does not say 
"behind" but "in back of'-iiot "under'' but "up 
under." We are now rewriting the instructions to 
that test to see if children understand the concepts 
in Black English. Black and white children can have 
the same cognition but communicate it differently. 
To the cognition "few," a Black child might say, 
"Well, that's not a whole bunch of ihem." Only 



the vocabulary is different, and a difference is not 
a deficiency. You doirt evaluate Black people in 
terms of how white they arc, but this is what the 
tests have done. They do not measure Black 
ability. 

Eliminating and inhibiting intelligence early in 
life is the best way to keep Blacks ovi of the sys- 
tem. I would opt for talking to children in the 
dialect they understand. If a child ran understand 
what you are asking of him or her on a test, that 
child will probably niiister the task. You pinViot 
expect an individual who has not been exposed to 
German or French to understand these languages. 
This does not mean that he or she lacks the 
capacity to learn German or French, but that the 
child lacks that particular exposure. 

Black parents should be concerned about 
both the predictive varlabrc, the test, and the 
criterion variable, the classroom. I think whites 
should also be vitally concerned. Brother Charlie 
Mingus said, "When they came .ind took the 
Catholics, I did not complain, because I am not iv 
Catholic. When they came and took the Unionists, 
I did not complain, because I am not a Unionist. 
When they came and took the Panthers, I did not 
complain, because I am not a Panther, but then 
one dav .they came and took me." I don't think wc 
should let another generation pass in this country 
that knows all about ex:ra-\ehicular space activity, 
atomic physics, and all of these highfalutiu things, 
but does not know what a human being is. 



i 



76 



77 



THE S.H.AT.T.* TEST 



A highlight of the conference was the partici- 
pation of more than 40 high school students, who 
reacted to the speakers, joined in the discussion 
groups, and wrote and administered the S.H.A.F.T. 
test. 

The students from Callanan Junior High 
School in Des Moines, Iowa, told conference 
participants to keep their tests face down until told 
to start. They would be given 10 minutes to answer 
22 multiple choice questions about student culture 
that were standardized on niiith grade students 
from Des Moines. Student proctors enforced the 
no-talking, no-peeking rules, although from time to 
time the educators asked for erasers, wanted to 
know the time, an3 called out, ''Hey, Teach! 
Where's the pencil si pener?" 

The test scores formed a perfect b'ell-shaped 
curve— a handful of the 650 conferees scored high, 
most were average, and a few below it. Besides the 
''public exposure— conferees had to wear black, red, 
or yellow armbands depending on their scores, or 
go bare armed for scoring too low— participants 
were told that their scores would be made part of 
their cumulative folders to haunt them for life, 

(Student's Hype Arranged for Teachers) 

1. What' is tlw slang word used to describe a 
blemish? 

a. Zilch 

b. Arg 

c. 2it 

2. The author of the book Right On is 

a. Julian Bond 

b. Jerry Rubin ^ 

c. Iman Baraka (LeRgi Jones) 

3. What are waffle stompers? 

a. Pancake chef 

b. Snowshoes 

c. Ice cream sandwiches 

4. Steal This Book was written by 

a. Allen Ginsberg 

b. Eldridge Cleaver 

c. Abbie Hoffman 



5. Who wrote the song 'Turple Haze"? 

a. Jmi Hendrix 

b. Aretha Franklin 

c. James Brown 

6. For what purpose would you use a roach clip? 

a. To keep ladies' blouses closed 

b. To hold the end of a reefer 

c. To get rid of bugs 

7. Who wrote Lo2;e S^ory? 
a- Ryan O'Neal 

b. Henry Mancini 

c. Erich Segal 

8. The author of Alice's Restaurant is 

a. Bob Dylan 

b. Alice Cooper 

c. Arlo Guthrie 

9. What can you get at "Alice's Restaurant"? 

a. Soul food 

b. Storybooks for children 

c. Everything you want 

10. "Ripple" is 

a. Rumor in a faculty 

b. Cheap wine 

c. A game of chance 

11. ^ The term rip off means 

a. To tear 

b. To steal 

c. To cop out 

12. What rock group sang the anti-drug song, "The 
Pusher"? 

a. The Who 

b. Steppenwolf 

c. Blood Rock 

13. "Tommy "is 

a. A British cop 

b. A fast sports car 

c. A rock opera 

*SJLA,F, r, stands for **Stiidcnt's Ilypc Arranged for 
Teachers,'* 




•78 



1 4. "Make tracks" means 
' a. To inject dope 

b. To bum rubber 

c. To split 

15. To "crash""is to 

a. ^ Have an acci dent 

b. Come down from the use of drugs 

c. Lose all your money 

16. Who were the originators of Jesus Christ 
. Superstar? 

a. Rado & Ragui 

b. Rice & Webber 

c. Brewer & Shipley 

17. Lenny is a play on the life of 

a. Lenny Bruce 

b. Lenny Bernstein 

c. Lenny Brezhnev 

18. "Getting off* means 

a. ' To feel the effect from the drugs you 

have taken 

b. A vacation from work 

c. To cease the addiction of heroin 

19. A**hit'Ms 

a. A robbery 

b. An internal dose of drugs 

c. A very popuhir teenager 

20. Hash is 

a. Cheap opium ^ 
' b. A mixturt? of pep pills 
c* A resin from marijuana 



21. A"hemmie"is 

' a. Asouped-up engine 

b. An Ernest Hemingway short story 

c. A shirt that has been shortened 

22. An ounce of marijuana is referred to as a 

a. Reefer 

b. Key 
* c. Lid 

— Written and administered by students 
from Callanan Junior High School, 
Des Moines, Iowa 



Key to S.H.A.F.T. Test 



1. 


c 


9. 


c 


16. 


2. 


h 


10. 


l> 


' 17. 


3. 


h 


11. 


b 


18. 


4. 


C- 


12. 


b 


19., 


5. 


u 


13. 


c 


20. 


6. 


b 


14. 


a 


21. 


7. 


c 


15. 


b 


22. 


8. 


c 









How To Score 

0- 4' correct-Nothing 
5-10 correct- Yellow, armband 
1 1-16 correct— Red armband 
• 17-22 correct-Black armband 



7S 




wrap-up 

by DiciglU Allen 



Testing is a part of the mindless process of 
education, that really is suited to the simpler 
society of the past. Education is out of date 
sociologically, psychologically, and physiologically, 
but I would rather work within the system than 
burn it down. Either way, traumatic ch*mge will 
hurt people— by and 'large, the wrongs people. If 
you were to close down all the educational systems 
in this country, the people in the upper middle- 
class would pretty much do it on their own, hut 
the people who most need to gain access lo educa 
tion would be done in. 

This doesn't mean that you shouldn't do in a 
particular test, because some need to be done in. 
For example, if you are like most audiences, onl^ 
about 20 percent of yi)U can name the capitals of 
North and South Dakota and North and South 
Carolina— except foi the fifth grade teachers. The 
real standard for an educated person is not being 
able to name states and their capitals but being 
able to use an atlas to find .them when one needs 
to. When I suggested to a fifth grade teacher that 
she let her kids use an atlas on the next test, she 
said, **0h, no! You couldn't do that. They'd all get 
it right." 

Our whole testing program is oriented toward 
a normative .assumption -an upper .aul a lower 
half, I won't bu> into an^ system that requires 
someone to fall x)ff the bottom, bacause this isn't 
my view of what education is about. An educa- 
tional system is needed that will assure a win/win 
proposition for children instead of the present 
zero/win game. 

Very few of us in education ever know when 
we're done, despite all the tests we give. If a stu* 
dent seems about finished, we enrich him or her, 
and in the name of high standards we go about it 
all backwards. Kids soon leiun that rewards are 
given not for achie\ement, but for keeping the seat 
warm, being good, and working hard. If a kid 
comes in with a theme, thiows it (m the teacher's 
desk, and says, "Here's something I whipped off in 
20 minutes. I hope you like it," the teacher says, 
"You should ha\e worked harder." If the same kid 



brings in the same paper and says, "Here's the 
eighth draft. I wish I had time for a ninth," she 
says, "Nice boy, Johnny; You are working hard." 

A revitalization of the testing program has to 
be thought of in terms of the broader objectives of 
education. The problem with this, however, is that 
imytime people want to avoid having to do some- 
thing, they say, "Ixt's stop and get our objectives 
organized." Tliis is good for at least two years. My 
objectives are the same its yoprs. I want all the kids 
to be constructive members of society, to be self- 
realized and have lots of skills, and to be happy,: 
healthy, and democratic. The objective of our 
school system, however, is status. With so many 
people being anointed first with high school 
diplomas and now with college degrees, it's getting 
hard to make that status system stick. Kducation 
should reflect a status system that is based on 
legitimate differences in ability.. 

Here is an example of the nonnormati\e kind 
of education I think we ouglu lo ha\e. Giaduate 
education students at UCLA ha\e to take **Statis' 
tics for Teachers." Assume tliai this course is 
nccessarv and that the objecti\c is lo make stu- 
dents learn it better. The course is divided into 16 
units, each one lasting a week. The first unit is 
called "Counting." The students start out in large 
groups, cntci supplemenyi5> groups, and Uicn join 
small lutoiial sessions, not quitting until they ha\e 
learned the week's work. The\ are tested as soon as 
the^ learn the material-some after the first hour 
and others after 15 houi^. In the typical college 
cour.se, students take a midterm at the end of 
about the fifth week. This is the first chance teach- 
ers get to find out whaMheir students are learning. 
Those for whom the instruction hasn't worked are 
fi\e weeks behind before anybody gets the first 
inkling. Tests come at the wrong time, but as long 
as the system has enough people clinking out at the 
end, we are not bothered about those who are 
ground up inside. The "Statistics for Teachers" 
course produces between 85 and 90 percent A's, 
proving that anybody can Icarn almost anything 
that we teach in sch(,ol. I am not sure that we are 



79 



80 



serving anybody* however, by reorganizing a sys- 
tem that's already no good and out of phase witli 
what society needs. 

The number one priority for every program in 
the School of Education at the University of 
Massachusetts is to combat institutional racism. 
Our admissions procedure produces a bimodal 
population of students— the Ivy League-PIii Beta 
Kappa type in one hump, and in the other, the 
people who are just smart. We talk to prospective 
students and admit the ones we like. When we get 
through^ with them, no can tell the two. groups 
apart anymore* This is very necessary if you are 
eliminating a testing program, for you should never 
patrpnizc the folks at the bottom by giving t^^cm a 
second-class education. We also think that writing 
dissertations does not indicate leadership in educa- 
tion and thai not all doctoral candidates— whctlier 
Ivy League or non-Ivy League types— sliould Iiave 
to write dissertations. My point is to find a dif- 
ferent stahdard and apply it to all students. 

One of the problems with testing is that it 
permits decisions only about individuals and never 
about institutions. I would like to use testing and 
evaluation to hejp me make decisions about our 
program, but I need information about the 
experiences that contribute to educational com- 



petence and leadersliip. We offer 16 programs, 
some self-contained with no cove requiieincnts, 
some that arc totally touchy-fecly, ajid others that 
are dice-them-up-and-competency-base-thcm. Each 
can succeed on its own terms and is allowed to do 
so. We must find out what is good for whom and 
for what. Aptitude treatment interaction may help 
us. We also have to begin to look at subcultural 
differences 5n a different way to find a unity with- 
in a diversity, to appreciate being different, to 
make our educational system reflect this, and to 
find standards and ways of checking up that are 
free of the insidious by-products that the testing 
program has given us. I want to test the people 
who can learn frbm tests and prohibit testing the 
people for whom !ea.rning is obstructed by tests. 

This society will never succeed until we 
recognize that we are part of A nuiltiracial, multi- 
class, multisex world. W*: must develop strategies 
to make people produce on their own good inten- 
tions. There must be a renovation of testing and 
curriculum and an end to the idea that teachers are 
neutral. So long as teachers have to pretend to be 
objective, so long as schools can teach only those 
things that are safe for everybody, education will 
be unreal. We are living in a complex world, and 
our school system has to become complex. 



80 



REPORT OF THE NBA TASK FORCE ON TESTING, 1975 



83 



introduction' 



This report is submitted to the 1975 Keprc^. 
scntativc Assembly by the Tlisk Force on Tcsimg\ 
in fulfillment of its responsibilities under New 
Business Items 51 and 28, adopted by the 1572 
Assembly, which stated: . ^ 

The NEA shall establish a task force to deal'with 
numerous and complex problems communicated to it 
under the general heading of testing. This task force- 
shall report its findings and proposals for further 
' action at the 1973 Representative Assembly. (Item 
1972-51) 

This Representative Assembly directs the National 
Education Association to immediately call a national 
moratorium on standardized testing and af the same 
time set up a task force on standardized testing to * 
research and make its fincfings available to the 1975 
Representative. Assembly for further action. (Item 
1972-28) ' . 



In the report of its findings to- the 1973 
Representative Assembly the Task Force set down 
some well-founded beliefs which have drawn a 
significant amount of attention from both inside 
and outside the profession. Follow-up efforts 
served to further verify and intensify the positions 
taken. The Task Force feels, therefore, that the 
most appropriate final report it can make to the 
1975 Representative Assembly is a reinforcement 
and redeclaration of those beliefs, with recom- 
mendations lo the Association (a) tp make them 
the basis for future NEA policy on testing issues 
ai^d (b) to continue seeking, through appropriate 
• program and other efforts, ways of countering 
widespread misuse and abuses in educational and 
psychological testing as they relate to teachers and 
students, particuhirly those who are culturally and 
linguistically different. * " 

The Task Force also addresses itself here to 
matters which have added weight to jts stated 
beliefs— to important court actions, some of which 
involve the united teaching profession; to the 
valuable liaisons it has established with other 
groups; to special writings developed for use by the 
Task Forcep to supportive literature; and to the 
moraJorlitm issue. 

The Tz^wrt. Force h indebted to the many per- 
sons who contribute their expertise to its total - 



three-year effort thro'igh personal or, 4ndirea 
testimony, consultation, or other assLsvance, and to 
those who have taken notice of its findings. 

This report was approved by the NEA Task 
Force on Testing at its final meeting on April 5, 
1975, by unanimous vote of the members present. 

TASK FORCE POSITIONS AND 
CONSIDERATIONS 

As stated in its first interim report and as' 
strengthened in further deliberation on the issues, 
the NEA Task Force on Testing believes:^* 

/. That some measurement and evaluation in 
education is necessary. , * 

2. That some of the measurement and evaluation 
tools developed over the years, and currently 
in use, contain satisfactory validity and 
reliability requirements and serve useful (mr- 
poses when properly administered and in- 
terpret^'d, 

3, That certain measurement and evaluation 
>^ tools are either invalid and unreliable, out of 

date, or unfair and should be withdrawn f, jm 
use, . 

That the training of those who use measure- 
ment and evaluation tools is woefully in- 
adequate and that schools of education, 
school systems, the education profession, and 
the testing industry all must fake responsi- 
bility^ for correcting these inadequacies. Such 
training^ must develop understanding about 
the limitations of tests in prediciin^^ potential 
learning ability, about' their lachof validity in 
measuring innate characteristics, and their 
dehumanizing effects on many students. It 
must also develop understanding of sttidentu" 
righis^elated to testing and the use of lest 
results. 

That there is overkill in the use of iests and 
that the intended purposes of testing can be 
accomplished through the use of individual 
diagnostic instruments, through sampling 
techniques which involve the use of tests, and 
through a variety of aliertiatives to tests. ' 
That the National Teacher Examinations are 
an improper tool and must hot be used for 
teacher certification, recertification, selection. 



6, 



ERIC 



82 



84 

^ * * 

hssigntnent, retention, salary (feterrnination, 
promotion, transfer, tenure, or (listnissaL 

7. '^yiat no test results stiould be used as a basis 
fojr allocation of federal, state, or lo(al funds. 

5. That no tests should be used for (racking stu- 
dents. 

9. That' while the purposes and procedures of 
the National Assessment of lulucattoii may 
have been initially sound, a niimh^r of staCi 
adaptations of the program in Mu liigan^ and 
New Jersey, for example-have subverted the 
original intent and as a result are harmfuL 
JLO. . That both the content and the use of the 
typical group intelligence . test an biased 
against those who are economically dis- 
advantaged and culturally and linguistically 
' ^tifferent. In fact, group'^inteVigence tests are 
, potentially harmful to all siiulents, 
IL Thai^ the use of the typital intrUigcnte 'test 
confributes to what has tome to be termed 
'Hhe self-fulfilling prophety/* whereby stu- 
dents' achievement tends to fulfill the expuota- 
tions held by others, 

12, That test results are often used by educators, 
students, and parents in ways that are damag- 
ing to the self-concept of many students. 

13, That the testing industry ynust demonstrate 
significantly increased responsibility for 
validity, reliability, and relrvame of their 
tests, for their fair application, and for ai- 
curate and just interpretation and use of the 
results, 

I f. That the public, and some in the profession, 
uiisinterpret the results of tests m they relate 
Ho status^and needs of groups of students as 
well as to individual students. 

15, That the overemphasis in as^'ssment prog)'anis 
on testing recall-type, cognitive facts has tend- 
ed to shift teaching emphasis to tasks which 
are ..simple -and ea.^y to measure^ and has rcz 
suited in serious inattention to thi' complex, 
higherdevel mental processes and to affective 
skills and attitudes which are .so difficult to 
tnea.sure but ivhich are equally and, in some 
respects, more impovtanl. 

■ " A . 

In sumfnary, the Task Force believes: 

' Thai the major use of lies ts should he to im- 
pro ve insiru it ion to diagn ose lear) i i) ig d if^ 



Jwulttes and to plan learning a<^tivities m 
response to learning needs. Tests mii.\i not be 
used in any way lo label and tias.sify students, 
to track students into homogeneous groups, 
to determine educational programs, to per- 
petuate an elitism, or to maintain some 
groups and individuals ^7/; their place** near 
the bottom of the socioeionomic ladder. In 
short, tests must not be used inaeays that wdl 
deny any student full aciess to equal eduia- 
tional opportunity. ' 

Some Special Considerations 

ETTlkTS OF Ti:STS OX MINORITIES ' ^ 

rinoughoul its study ihe Task Force has been 
especially impressed wilb ihe depth of leeling and 
the weight of evidence. against group standardized 
tests as reliable/valid measures of achievement and 
intelligence. Throughout its stated beliefs it has 
alluded to the injurious and prejudicial aspects of 
such tests. The term standardized implies homo- 
geneity, steieotvping, .uul equalized dexelopment 
and acbie\'t^nu'nt, and is contradictory to the best 
interests of a pluralistic society, .The piactice of 
standardized testing has, in fa<^t, depiixed 
minorities— the economically di.sad\ antaged, cul- 
tuially and linguistically* different, «md nvi)nien — 
access to ecptal educational opportunity, / 

Traditional TQ testing particularly has come 
undei increasingly heavy attack for lalsely labeling 
many mini)rity children as **inen tally retarcfed,'* 
based on what Jane Mcicer has termed Auglo- 
centric measures/^ Such tests are touted as re- 
Iiablc7\alid measures of the al)ilit\ and achieve- 
ment of varying populations even tlJough the test- 
takers' educational and cultural" backgrounds-, op- 
portunities, and experiences may be markedly dif- 
ferent from those on wlu^m the tests are stan- 
dardi/.eci. 

Recently, Robei:)- L. Green, educational 
psychologfst and dean ^ of the College of Urban 
Development t\\ Michigan State University, called 
intelligence testing "the awejiome danger^' and 
pointed to the potential compoundfing o?'y'^*»^ 
danger by continued use of tt^aditional test;^: 

, . , experiences of black and other nuiiuritv ( hi'idren 
arc not reflected in the (onteni of,ihe trsi. This bias 



85 



is even more apparent when the child's opportunities 
have been limited due to poverty. Consequently, 
many 'black children oStart test-taking with a good 
chance of ''flunking" an "experience" they liave 

never^been exposed to When a child is labeled as 

a "ne'er-do-well" in the eady grades and is forced to 
keep wearing that label, important educational oppor- 
tunities are denied him. Sometimes he may never be 
taught to read; he certainly will not be given access to 
college preparation courses. Discripiination in educa- 
tion means disadvantage in the job market. A low- 
paying job means low-income status— so a l^st vic- 
tim's children may become test victims themselves."^ 

^ Widespread dissemination of test results 
which can be easily misinterpreted, cases of in- 
vasion of privacy, and proposals for educational 
funding on the basis of test scores add further 
evidence of the potential harm fulness of stan- 
dardized testing. 

The Task Forcd restates emphatically that 
since currently used standardized \ests in general 
are developed aftd normed for students of .\nglo- 
American middle-class culture and economic 
status, any use of the results of standardized test- 
ing to place or track students, to denigrate mi- 
nority intelligence, to discriminate against groups 
or individuals, to restrict funding of programs, or 
to misinform the public** constitutes deplorable 
practice and denies access to equal opportunity. 

The Task Force calls for a humanistic ap- 
proach to student evaluation on the part of all 
those who have a role and responsibility in the 
process. In particular:* 

• The Task Force urges teachers 

— to develop understanding of their stu- 
dents^ socioeconomic backgrounds and 
sensitivity to their individual needs and 
problems ^ 

— to refuse to administer tests which they 
find' to be biased 

to secure by appropriate means their 
right to be involved in schooFand school 
district decision making related to test- 
7 ing 

— to exert collective influence on the test- 
ing industry and on state and local 
school systems in order to secure from 



them a firm commitment to evaluation 
programs, the purpose of which is not to 
c6mpare students but to improve in- 
struction. 



• The Task Force urges spokespersons of all 
cultures to continue exposing erroneous con- 
tentions that some groups in society are 

. genetically less intelligent than others. 

• The Task Force urges the testing industry to 
take greatly incrcfised responsibility for turn- 
ing out fair and bias-free tests and for con- 
slantfy monitoring the distribution and appli- 
cation of their products to ensure proper use. 

• The Task Force urges education agencies at all 
levels to institute sampling procedures for all 
large-scale assessments, the results of which 
should be used for general information pur- 
poses only. 

COURT ACTIONS 

Years of controversy over testing practices has 
also led to civil suits. The continued use of tesjs in 
teacher licensure and hiring and continued use of 
biised instruments with" students who are dis 
advantaged and culturally and linguistically dif- 
ferent increase the possibilities for legal action 
" against school systems and the almost unregulated 
testing industry. 

In 1971 the Supreme Court ruled in Griggs v, 
Duke Power Co, that tests given to job applicants 
had to be job-related. This case has been cited in 
court decisions related to standardized testing of 
teachers. * It was referred to, for ex »mple, in the 
1974 decision in favor of 13 Black teachers agaii . 
the school board of Nansemond County, Virginia, 
as were arguments presented by the NEA in an 
amicus curiae, brief. The Fourth Circuit Court of 
Appeals niled unconstitutional a hiring require- 
ment that teachers take the Nat.onal Teacher 
Examinations (NTE) and achieve a minimum score 
on the common examination. The effect of the 
, requirement was to substantially diminish the 
Black teaching force. The ruling overturned the 
trial court's conclusion that the test had content 



^ 86 



validity, noting that no evidence was presented 
which established a relationship betwecji questions 
on the test and knowledge required for teaching, 
and that it was arbitrary to apply a general knowl- 
edge test to teachers of different subjects bccaiisc 
their jobs are substantially different. 

The Nansemond case is likely to have positive " 
impact on pending litigation in North Carolinh in 
which the united teaching profession is imolved. 
The NEA and the state affiliate have intervened in 
a Justice Department suit challenging the validity 
of state requirements for minimum NTE scores for 
certification purposes which affect both employ- 
ment and placement on salary scales. In South 
Carolina, the 3tate education ;issociation has filed a 
complainf under Title VII of the Civil Rights Act 
of 1964 challenging' the use of minimum NTH 
scores for certification. A favorable decision in a 
current Georgia suit could eliminate the NTE re- 
quirement for advanced certification and it:> 
potential restrictions on promotion and pay 

A precedential award in Association- 
supported litigations, including two majoi cases 
involving teachei test requirements, was announced 
early in 1975. A federal court in Mississippi order- 
ed tu'o school districts in that state to pay 
$106,000 in attorney fees, expenses, and court 
costs in cases in which it was alleged and detei- 
mined that racial discrimination had played a part 
in employment decisions during a |)eri()d ol 
desegregation. One of the -ases was brought on 
, behalf of a group of Columbus teachers who were 
fired for failing to achieve minimum scores on the 
NTE; another case involved a group in Starkvillc 
who challenged required scores on the Graduate 
Record E;x;tnniiations (ORE). Most of the teachers 
had previously won the right to reinstatement with 
back pay. 

The NEA and the New. Jersey Education 
Association are challenging that state's assessment 
program in order to prevent dissemimUion of stan- 
dardized test sc res which might violate civil and 
constitutional rights of both teachers and students, 
and cause racial and ethnic polarization by per- 
mitting degrading stigmatization and illegal classifi- 
cations. The complaint has so far resulted in action 
by the State Board of Education to remove an 
ambiguous section of the administrative code that 
could have been interpreted to permit using test 
results in tonjunction with other data to support 



disciplinary action against teachers. 

Uobson V. Hansen (1967), in .which the court 
abolished the track system in the District of 
Columbia public sch(H)Is, was probablv the land- 
mark case t>ing standardized testing to denial of 
equal educational opportunity, in this instance to 
Black and economically disadvantaged students. 
More recentiv, two cases still in the courts in 
C<difornia are seeking to uphold *hc constitutional 
rights of culturaLy and linguistically different 
minority students b> preventing the use o!' stan- 
dardized IQ tests. The judge, the same in both 
cases, has found that standardized IQ testing causes 
a disproportionately high percentage of minority 
students to be placed in classes for the educable 
mentally retarded (EMR). In the case o{ Diana zk 
flic State Board of Education, involving Chicano 
students, a stipulation was issued ordering local 
boards to come up with a formula to reduce the 
variance between the percentage of Chicano chil- 
dren in EMR classes and the percentage in the g'm- 
eral school population; planning is still undei way 
between the local school systems and the State 
Department of Education. The case of Larry P, v. 
Riles, brought on behalf of Black students/ led to 
court-ordered stoppage of IQ testing of Black stu- 
dents in the state. The economic f^aor inherent in 
recent legislation making IQ testing in Calif<,rnia 
optional at school disuict expcMue n\\^ also have 
tended to halt the practice with Ai students in 
some places. 



I.IAISOXS 

* The initiating of dialogues with other organi- 
zations and agencies involved with test develop- 
ment, use, and research must be c<msiclered an 
important acc<nnplishment of the Task Korce. And 
it would be in the best interest of practitioners for 
the Association to continue the dialogues and to 
establish cooperative working relationships toward 
th(* goal of eliminating test misuse and abuse. 

Standards Dcvclofment (troups 

The Task Force continues to be concerned 
over the lack of diiec t te<u her involvement in the 
formulation of testing standards; foi example, the 



8b 



87 



American Psychological Association's (APA) Stan- 
dards for Educatiomd and Psychological Testing. 
These standards were de\ eloped by a joint com- 
mittee of the APA, the American Educational Re- 
search Association (AER/\), and tl:e National 
Touncil on Measurement in Education (NCME). 
The Task Force pursued its concern* informally 
v%dth APA staff and followed up with a request to 
the APA Board of Directors to approve \\\6 in- 
clusion of an NEA representative on the joint com- 
mittee, which is launching a project to develop 
guidelines on evaluation of school proij^rams. In a 
letter to APA, the Task Forrc ( hairperson said that 
"such representation should also explicitly provide 
that the NEA represent at i\e.s he involved on a con- 
tinuing basis with the gV')Up and an\ other i;roup 
which may be constituted to ,cri\c continuing direc 
tion to, substance and editorial a(hi(e on, and 
make decisions about acceptance, publication, and 
distribution of such guidelines." No action had 
been taken on that request at the writing of this 
report. 



Testing Industry 

In March 1974 the Task Force formally ex- 
pressed its disappointment that the Educational 
Testing Service (ETS) had delayed enforcement of 
a cut-off in reporting NTE scores to South Carolina 
because they were being used foi purposes of 
teacher certification, which even ETS considers a 
misuse of the test. The Task Force notes here that 
the enforcement was later effected, and commends 
the ETS action (and reiterates that the united 
teaching profession is presently seeking to elimi- 
nate the South Carolina requirement). The Task 
Force also welcomes ETS's recent expression of 
interest in the Task Force beliefs and its initiation 
of a meeting early in 1975 with NEA and New 
Jersey staff representatives and the Task Force 
chairperson to discuss common concerns. 

Federal Government 

With project funding.by tlie National Institute 
of Education (NIE) in mind, the Task Force was 
anxious to learn what is being done at the federal 
level to encourage research which could have posi- 




/ 

tivc impact on the future of tcating. NIE spokes- 
persons conferred with the Task Force .;nd re- 
vealed that some projects already approved for 
funding reflect some of the Task Force concerns. 

In this instance, also, the Task Force has 
broached the subject of teacher representation in 
those decisions which will affect their practice. 
Though it is aware that the work of NIE is in the 
public interest the T^isk Force has registered its 
concern that the public interest will not be well 
served unless substantial numbers c teachers are 
represented in NIE goal setting and suggested'that 
the Association be invited to appoint piactitioners 
to all NIE panels. Rathei than direct involvement, 
howevei, some NIE personnel seem to see NEA's 
lole as lol)l)\i>l in the legislative process of defining 
paunuetei's of NIE responsibility. Such indirect and 
aftei-thc-fact invoiveuK'nt will continue to be 
unacceptable to teachers. 

Another question put to the NIE spokesper- 
sons had to do with the Institute's interest in the 
establishment of a national center for certifying 
tests. The response was that at this time th^? extent 
of such interest probably would be in exploring the' 
possibilities of such a center. This, oJ course, is the 
focus of a current NEA staff study described 
below. 

The Task Force last year informed NEA 
G()\ernment Relations of their concern o\er die 
Quie amendment to the then pending 11. R. G9 (re- 
vision of the Elementary and Secondary Education 
Act) which proposed to tfe educational funding to 
testing. It was pleased to learn that its concern was 
relayed and may have been a factor in the with- 
drawal of that amendment. The issue if now under 
formal study by both NIE and the General Ac- 
counting Office. 



NEA FEASIBIUTY STUDf FOR 
TEST CER TIFICA TION 

Although the Task Force fulfills its official 
responsibility with this report, it views a parallel 
staf^ assignment as an extentiop of its work and 
wants the general membership lo be aware of it. 
Staff in the Professional Excellence goal area are 
currently conducting a study "to determine the 
feasibility of a system whereby the NEA certifies 
tests or other procedures for student or program 



88 



evaluation" (Subobjcctlvc 1.4). Three Task Force 
members also serve on the niiie-meniber advisory 
committee which is engineering the study At this 
writing the committee has established some useful 
contacts-with the APA, the UCLA Center for the 
Study of Evaluation, the National Council of 
Teachers of English, and the National Association 
of Elementary School Principals. It has drafted a 
rationale for NEA certifying tests and the appro- 
priate uses of tests, has .outlined alternative 
strategies, ^nd plans a field survey for the final 
quarter of FY 1974-75 to obtain reactions to the 
proposed procedure. 

SPECIAL PAPERS 

During its tenure the Task Force on Testing 
has initiat,ed work on written statements to 
support and elaborate on some of its expressed 
beliefs. Various drafts of these papers have already 
been cited and utilized in some quarters both in- 
side and outside the profession. All members of the 
Association should be aware of their existence. 
Thre^ of the papers are in final form and are pub- 
lished in this report. They are: 

1. "Roles and Responsibilities of Groups Con- 
cerned with Student Evaluation Systems." 
This statement directs ^. to specific groups 
recommendatioSris which the Task Force con- 
siders essential for achieving the goals of 
sound and fair development of tests, their ap- 
propriate distribution and -administration, 
accurate and fair interpretation of results, and 
relevant and constructive action based on the 
results. The groups addressed are teachers and 
their associations, other professional associa-, 
tions, students, minorities, the testing in- 
dustry, school administrators, higher educa- 
tion, and government agencies. 

2. "Why Should All Those Students Take All 
Those Tests?" This paper reflects the Task 

^ . Force's opinions on random and matnx 
sampling as opposed to blanket testing. It 
incorporates material developed by Dr. Frank 
B. Womer of the Michigan School Testing Ser 
vice. University of Michigan, on determining 
the use of sampling procedures. 

3. "Guidelines and Cautions for Considering 
Criterion-Referenced Testing." The concept 
of criterion-referenced testing (also termed 
objectives-referenced testing) has been pro- 



ERIC 



nioted as potentially more useful than norm- 
referenced testing for measuring learning out- 
comes for the purpose of improving instruc- 
tion. This paper attempts . to define the 
criterion-referenced concept and to clear up 
some of the confusion which surrounds it. 
Fifteen caveats arc- listed and discussed. A 
glossary of measurement terms is appended. 

Two other important statements which have 
been outlined require expertise that is beyond the 
time and capabilities of the Task Force in order to 
give them the highest credibility. These have been 
incorporated into and will be completed as 
products in the goal area. Professional Excellence. 

1. "Some Potential Alternatives to Standardized 
Tests for Evaluating Student Progress and 
Diagnosing Learning Needs." Alternatives in- 
clude criterion- or objectives-referenced tests, 
oral presentations by students, individual 
diagnostic tests, group diagnostic tests, teach- 
er-made tests, student self- and peer-evalua- 
tion, open admissions, school letter grades, 
subjective evaluation by teachers, contracts 
with students, intei-vievvs, parent-teacher con- 
ferences, student narratives, student products, 
and actual student performance. This collec- 
tion has promise as a handbook for teachers. 

2. A unit or module for preservice and in-sei-vlce 
teacher education pertaining to testing has 
thus far been outlined in two forms; schema 
and guidelines. This project stems from con- 
cern over the present inadequacy of training 
as expressed in the Task Force's belief No. 4 
(seep. 83). 

The Task Force sees ail of the above as having 
potential, collectively, as an Iv'EA "awareness kit" 
on testing issues. 

6 

SUPPORTIVE LITERATURE 

The T:isk Force \^as impressed with much of 
the vast .miount of li.erature that has been pub- 
lished on testing and its effects, and considers it 
appropriate to cite here a few recent items which 
influenced the formulation of Fas.^ P'orce heliefs or 
which support some of them. (Citations of other 
important resources will be found in previous Task 
Force reports.) 



8V 



7 



89 



REFKRKNCKS 



Blachford, Jean S. **A 'Icadier Viow.s Crilcrion-Rofcrcnccd 
Tcsts^'* Today's Education 64: 36; March-April 1975. 
'i*t)ints teachers must consider as ihey "heeome part of 
til? "fiatioiial inQvement loward niterion-refcrenced 
tests,'* and a plea for proper in-service education. 



Dc Avila, Edward A., and Ilav:issy» IJarbara. 'Tlie Testing 
of iMinority Children: A Neo-Piagetian Approach/^ 
Today's Education 63: 72-75; Noveniher-neceinber 
11)74. A challenge to industry^ attempts at re.slriutur- 
ing present tests to produce bias-free instrnnients, .uul 
descrijbtions of an alternative assessment model .md a 
computeri/ed system for use of test data both fo; gen- 
eral information and to individualize inslruttion. 



Gartner, Alan; Greer, Colin; and Riessman. Tr.mk, editors. 
The Neiv Assaults on Equality: IQ and Soiial Stratifi- 
calion. New York: Perennial Library (paperback). 
Harper & Row, 1974. 225 pp. Nine experts examine 
tlie past and present of the IQ controvers> and draw, 
some important conclusions about the role of IQ in 
society. 



Goslin, David A. Teachers and Testing, New York: Russell 
Sage Foundation, 1967. 201 pp. An exploratory study 
of the uses of standardi/.ed tests in schools, teachers' 
experience with tests imkI testing, their attitudes and 
roles. 



Green, Donald Ross. Racial and Ethnic Hia^ in Test Con- 
struction, Monterey, Calif.: McGraw-Hill. n.d. 
Adapted from a federally funded study of the s.jne 
title. Tbe resea- 'ler found the need for clumges ni lest 

^> construction p:ocedures to produce unbiased instru- 
ments and suggests that re^scarch should be a standard 
part bf pioducing a test. 



Ilolmen, Milton G., and Docter, Richard. Educational and 
Psyihological Testing, New York: Russell Sage Found- 
ation, 1972, 218 pp. An ev:ilna»ive study of the testing 
industry, its products, and liow they are used, with 
action recommendations for "those who influence the 
gatekeepers in our society." ,^ 

Mercer. Jane R. Labeling the ,Mentally Retarded, Berkeley: 
University of California Press. 1973. I* etierally spon- 
sored siudy of ''Clinical and Social System Perspec- 
tives on Mental Retanlation" in an American com- 
munity. In a popularized description of the study (see 
•MQ: Hie Lethal Label," in Psychology Today 6: 
11-47, 95-97; : eptember 1972), Mercer says that 
"sthools ,seem to have the primary* lesponsibility for 
identifying the mentally retarded" via the IQ test, 
which she concludes is inaccurate and imfair. 

National lulucation Association. Evaluation and Rejunting 
of Student Achievement, What Research Says to the 
Teacher series. Washington, D.C: the Association, 
1974. 32 pp. Review of seletted research and Jitera- 
tnre on (a) purposes of evaluation and reporting, (bi 
their development in relation to different educational 
philosophies and teaching methods, (c) the best way 
to '■eport achievement, and (d) evaluaticm to improve 
instruction. 

Stiggins, Richard .J. "An Alternative to Blanket Stan- 
dardized Testing." Today's Education 64: 38-40; 
March-April 1975. Ai: explanation of and argument 
for depending on random and matrix sampling in edu- 
cation^testing. 

Weber. George. Uses and Abuses of Standardized Testing in 
the Schools, Occasional Papers, No. 22. Washington. 
D.C: Council for Basic Kducation, 1974. 38 pp. Brief, 
clearly written critique of intelligence, aptitude, and 
achievement tests; their uses, limitations, and abuses; 
and discussion of current controveisies surroun:Ung 
standardized testlnj^ 



ERIC 



90 



RECOMMENDATIONS 



The Task Force recommends that: 

c 

1. The Assc iation incorporate the principles 
inherent ii the stated beHefs of the T;isk 
Force on Testing in any and all futine official 
NEA policy on testing of students and teach- 
ers and the uses of tests and their results. 

2. The Association continue the liaisons estab- 
lished by the Task Force with: 

a. The Joint Committee on Standards 
Development of the American Psycho- 
logical Association, the American Educa- 

' tional Research Association, and the 
National Council on Measurement in 
Education. 

b. The National Institute of Education. 

c. The Educational Testing Service. .(The 
Task Force also recommends that the 
Association establish similar relation- 
ships with other members of the testing 
industry.) 

3. The Association develop a strategy for estab- 
lishing vviih, other groups and organizations 
formal alliances for the purpose of com- 
batting deleieiious testing practices. These 
might include the National Ass(»ciation for 
the Advancement of Colored People, the 
Association of Black Psychologists, the 
Mexican-American Ixgal Defense and Educa- 
tion Fund, the National Urban League, the 
Civil Rights Commission, parent groups, and 
other educational organizations, e.g., Associa 
tion f()r Super\ision and Curriculum Develop- 
ment. 



"5. 



The Executive Committee approve the papers 
entitled **Roles and ResponsibiUties of 
oups Concerned with Student Evaluation 
Systems," **Why Should AH Thos' Students 
Take AJl Those Tests?" -and "Guidelines and 
Cautions for Considering Criterion-Ref- 
erenced Testing," and that the Association 
publish theni as an information package for 
distribution to the leadership network, and for 
general availability. It is further recommended 
that the proposed handbook on **Alternativcs 
to Standardized Testing** and the proposed 
module on testing for preservicc/in-sei'vice 
teacher education be made components of the 
informat ion package. 

The Association complete a thorough 
exploration of ahe feasibility of a system 
whereby the NEA certifies tests or other 
procedures for student or piogram evaluation. 
Suph exploration is currently under way as a 
subobjective of the Professional Excellence 
goal area. 

The Association temporarily set aside the 
moratorium on standardized testing as a 
national objective (as called for in New 
Business Item 28 adopted in 1972) in oidei to 
concentrate its energies in this area on lending 
support to affiliates as they implement 
strategies to challenge standardized testing; 
for example, uiitiating court actions on behalf 
of students or teachers, attacking specific test 
instruments, seeking alliances with othci 
groups which have a vested interest in 
countering test abuse, cro^s-committee 
planning for remediation of problems lelatcd 
to testing, de\eloping negotiation protedarcs 
and language clealing with testing issues. 



ERLC 



CONTRIBUTORS 



/ 

\ 



ERIC 



4 



93 



CONTRIBUTORS 



Dwight Allen, Dean, School of Education. Uni\crsil> oT 
Mnssnchusctb Amherst. 

Jean S. lilachfot ' isroom Teacher. New Uriinswik, 
New Jersey; iV NliA Task Force. on 'Tqsttnj*. 

William F. BrazzU , Prolessor of Higlicr .Rtjucation. 
University of Connerticuu Storrs. • 

Jose A, Cardenas, Snperinteiideni, Kdgr .ood Independent 
School District. San Antonio, TexVs. 

Lupe Castillo, Classroom Teacher. San I-rancisco, Cali- 
fornia; Mcnitier. NEA Task Force on re.stmg. 



Dorothy Lee Collins, Classroom Teacher (Couusch^r). ^an 
Antonio, Texas; Member. NEA Task Force on "res^ni^;. 

Charlotte Darehshori, Teacher, Primary Grouping. William 
Pcnn ElemcntarY School, Hijkcrsfield. California. 

Edzvard A. De Avila, Director of Educaliona! Phmning and 
Research, Bilhigiial Children Iclcvision, 

Richard F. iJocte*-, Professor of P.svcl,olog\-. Cilifornia 
State University, Northridge. 



lirenda S, Engcl, Visiting Assistant Professor, Ecslcy Col 
lege. rambridj»e. Nhissachnsetls; Former Public School 
Art Teaclier. 

Barbara I lavassy, Lan^'lcy Porter Ncuropsyclfijiru Institute, 
University of California. San Francisco. 

Milton G. Iloimen. Professor (^f Management and Associate 
Dean, School of Business Administration. University 
of Southern California. Los .^ngele^s. 

Pilialoha Lee Loy, Classroom Icacher. Honolulu. Hawaii- 
Member, NEA Task Forio on Testing. 



Bernard McKenna, XE.N , Instrut ti<m and Professional 
Development. * / 

Jane R. Mercer, Associate Professor of Sotiolog>% Univer- 
sity of California. Riverside. 

La^ercnce /.Vrrj/rj,, Classroom Tcac hrr. Santa .Maria, Cali- 
fornia; Memlier, NEA Task Force on Festing. 

Frarais QttinttK NF.A Instruuicm and Professional Develop- 
ment. 

Charlen J. Sanders, Classrooiii leacher (Secondary Coun- 
selor). Millinocket. MainV; Chairperson. NEA Task 
I'orce on Testing. 

Robert S. Soar, Foimdations of Education. Institute for 
Development of Human Resourrcs. University <)f 
Florida. 

Ruth M, Soar, Florida Edu( itional Research and Develop, 
ineni Council. <> 

Thelma S/fenccr, Director. Feacher Edutation Examination 
Program. Educational Festing Service. Princeton. New 
^,ersey. - * - 

Richard J. Stiggins, Assistant Director of 1e?t develop- 
ment. .\merican College Festing Program; Former 
Coordinator of Kducational Research and Program 
FvaluaticHi. hdina (Minnesota) Public SchooKs. 

Fdwin /'". Taylor, Senior Research Sc ..utist. Department of 
PI ysits and Division for Study and Re.scardi in Edu- 
cation. Massachusetts Institute of Technology^ 

Robert L. Williams, Director. Black Studies Program, a id 
Professor of I^SAch()log>^ Washington University. St. 
E<niis, Misscniri. 

L^-roy Wiluni, Classioom l ead er. Ocala. Horicla; Member, 
NEA Task Force on Fesriiig. 



91 



94 



FOOTNOTKS AND REFERExNCES 



"What's Wrong with Standardized Testing?" by Bernard 
McKenna 

^ For sources and further explanation of Alfred l^inet*s 
work and Lewis 'I'erinan and Henry Goddard's work, see 
Kamin, Leon J. "The Politics of l.Q." ^^atiojial Elementary 
Principal 54: 1 5-22; March-April 1975. 



"Why Should All Those Students Take All Those Tests?" 

' In Task Force and Other Reports presented ;o tlie 
Fifty Second Representative Assembly of the National Kv' 
cation Association, July 3-6, 1973, Portland. Oregon, pp. 
26-46. 

^House» Ernest R.; Rivers, Wendell; and Suifflebeam. 
Dan. An Assessment of the Michigan Accountability Sys- 
tem, Michigan Education Association and National Kdu 
cation Association, March 1974. pp. 14-16. 

•^National Education Association. '^Criteria for Evalu- 
athtg State Education Accountability Systems." Washing- 
ton^ D.C.: the Association, n.d,. 

^Womcr, Frank H. Developing a I.arge-Stale Assess* 
ment Program. Denver: Cooperative AccouMtability Project, 
1973. 

^Psychometrics in the strictest sense of the definition 
has to do withcthe measurement of mental abilities. It h.is 
come to be used much more broadly to tie fine a wide range 
of activities in assessment and evaluation. 

^For information on probability :»ainples, S't Womer, 
op. cit. 



*'Guidclincs and Cautions for Considering Criterion^Rcf- 
erenccd Testing" by Bernard McKenna 

'Baker, Eva L. "Beyond Ol)jectives: Douiain Ref 
ercnccd Tests for Evaluation and Instructional Improve 
ment." Educational Technology, 1973. 

"Bloom, Benjamin S.; Hastings, J. Thomas; and 
Madaus, George F. Chapter II: "Tie Cooperative Develop- 
ment of Evaluation Systems.** I lantlhook on Formative anti 
Summative Evaluation of Student Learning, New York: 
McGraw-imi, 1971. pp. 249-58. 

*^GIaser, Robert, and Witko, Anthony j. "Measure 
ment in Learning and instruction." Educational Measure 
ment, (Edited by Robert I.. Thorndike.) Wash:.,, ton. D.C: 
American Council on Education. 1971. pp. 625-70. 

"^llivcly, Wells. "Donuin-Referenc ed Testing.' Eduia 
tional Technology, 1973. 



^llouse, Ernest R' "Validating a Goal-Priority Instru- 
ment/* Paper presented at the annual meeting of American 
Educational Research Association, New Orleans February 
25.March 1, 1973; 

^ ; Rivers, Wendell; and Stufflebeam. Dan.^ii 

Assessment of the Michigan Accountability System, 
.Michigitn Education Association and National Education 
Association, March 1974. 

'Klein, Steplien P.. and Kosecoff, Jaccpieliiie. Issues 
and l^rocvdures in the Development of Criterion^ Referenced 
Tests, Princeton. N.J.: ERIC Clearinghouse on Tests. 
Measurement, and Evaluation. September 1973. 

o 

Milhnan, Jason. **llow To Make Assessment Plans for 
Domain Referenced Vc^is.'* Educational Technology, 1973. 

^Popham. W. James, and llusek, R. R. ^'Implications 
of Criterion- Referenced Measurement." Journal of Edu- 
cational Measurement, 1969. 



Stake, Roheit E*. "Measuring What Learners Learn.** 
School Evaluat ion. (lOdited by Ernest R. House.) Berkeley, 
Calif.: McCutchan Publishing Corp., 1973. 

" ^ jiful Goolrr. Dennis. "Measuring Goal 

Priorities." School Evaluation. (Edited by Ernest R. 
House.) Berkeley. CahL: McCutchan Publishing Corp:, 
1973. 

f 9 

'-'Wome.-. Frank B. "What is Criterion-Referenced 
Measurement?'* IRA Committee on the Evaluation of Read- 
ing Te^ts. 1973. 



"Criticisms of Standardri/ed Testing** by Milton (;. llolnien 
ind Rirhard F. Doctcr. 

'Passanella. Ann K.; .Manning, Winton IL; and 
I indikyan. .\urhan. "Critinsn^s of Icstmg: I." Unpublished 
re[)ort to Commission on Testi. College Entrance Exa^m- 
nation BoaUl, 1967. ED 039 395. 

^Goshn. David .\. "What's Wrong with I osts and Test- 
mg." College Hoard Review 65: 1218; l all \0(u\ College 
Hoard Rvvtcw 66: 33-37; Wmter 1907. LI) 039 392. 

'^llnluR'n. Milton G.. and Ooeter. Richard F. Educa- 
tional and Psychological 'testing: A Study of the Industry 
and Its i'racsiies. New York. Rus,sell Sage I oundalion, 
1972. 

'^Rosrnthal, Robert, and Jacobson. \. eiinrc, l^gmalion 
in the Classroom: Teacher Expeitation and Pupih" Intel- 
leituat Devetopmtnt. New \<)rk: Holt, Rniehart and 
Winston. 1 968. 



95 



"Problems in Using Pupil Oiittomos for Tciclier r,\aluation** 
by Robert S. Soar and Ruth M. Soai: 

^An(lcr«5on, G.J. * Kffccts of Classroom Social Cliniato 
on Individual Learning.** Amcrkon Educational Research 
Journal %: 1 :J5-53; March 1970. 

"Bcrcilcr, C. *'Somc Persisting Dilemmas in the 
Measurement of Cliamje."' ProhUnn hi Measuring Change. 
(lUliled by C. \\\ Harris.) Madison: University of Wisconsin 
Press. 1963. p. X 

^lirophy, J. i:. Stability iti Teacher Effectiveness, R S< 
D Report Series 77, Austin' Research and l)e\elopment 
Center for Teacher Kduiationv Uni\crsit> of "1 cxas. Jul> 
1972. 

'^Cronbacli, L. .j. Essentials of Psychohgieal Testing, 
Second edition. New Vii>rk: lJaq)er and Brothers. 1^)()0. p. 
131. 

^Flanders, N. \. "Tlie (flanging Base of IV'i forni.uu e 
B.ised Teaching." Phi Delta Kappan 35: :M2 13. ]anuar> 
1974. 

^Garber. M., and 'Vare. \V. H. "The Home K/iviron- 
meni as a Predictor of School Achievement.'" Theory Into 
Practice \ \\ 190-95: June 1972. 

'bord, r. M. "Klemeiitary Models for Measu/int; 
Change." Problems in Meautring Change. (Kdiled b> C. \V. 
f hurts.) Madison: University of Wisconsin Pie^s. 19(1:5. 
Chapter 2, pp, 21-38.* 

*^McI)onaId. 1-. j. -rhe State of the Art in Per- 
formaiuf Assc^smtnt of Te;uiiing Competence."' Per* 
forntance Edtnation: Assessment. (Kdited by V. W. 
Andrews.) Al?jan\ : Mulii Si.Ue Consortinn; on Per for 
mance-Hasc<) ftathei Kducation, New York Stat*- ^ duca* 
tion Hepartment, 1974. 

9 

.\Iayeske. C. W.. and oihers. .'I Study of Ojtr Xatwn's 
Sihools. U.S. Department of Health, bdutation. and 
Welfare, Office of Kdu' ition. Report No, nillAV OIv 
72 1 12. W.<>hint;ton. D.C. GoNtrnmcnt Prmlnig Offut, 
1972. 

'^Medley. IX M. "Research and Assessment in PlH b." 
AACIK Leadership Irainim; Conference on Performance- 
Based leather Kdiu.iiion. St. boiris. April 30. 1974. 



Mo.stcllcr. I .md Moynihan. I). I\ Oh Equality oj 
Edmatioiicl Opportunity. .\e;v York: -Random Ilonse. 
1972. 

'^P.ige. b. lb "A linal l ootnore on PC and OPX).*' 
Phi Delta Kappan 54: 575: April 1973. { ' 

. "How We All l ailed .'t Pei formam e X:un 

iracting.*-/Vn'Oc/^« Kappan. 54: I 1 5- 1 7: ()( lober 1972. 



^ ^Rosenslune, Barak. '*rhe Stabilit) of Teacher 
Kffects n Student Achievement.** Heviexv of Educa- 
ional -rc/i 40: 647-62; December 1970. 

''^Smalb Alan A. **Accountat/lity in Victorian En- 
gland.** Phi Delta Kappan 53: 438-39; March I ^ 72. 

^5so ar, R. S. "Optimum Teacher-Pupil InteracHon 
for Pupil Growth.'* Educational Leadership Research 
Supplement 2: 27.5-80; December, 1968. 

/ /? 

Soar, R. S. and Soar, R. M. Classroo*n Behavior^ 
Pupil Characteristics, and Pupil Groivth for the School Year 
and for the Summer. Grant numbers 5 ROl NHI 15891 and 
5 ROl Mil 15626, National Institute of Mcnt.U Health, U.S. 
Department of Health, Education, and Welfare. Gainesville: 
University of Plorida, 1973. 



17 



*An Empirical Analysis of Selected Fol- 



low Through Programs: An *ixampleofa Process Approach 
to hvabiation.*" Early ChildKood Educat'oti. (Kdited by I.J. 
(lordon.) Seventy -first Ye.irbouk. Part 11, National Society 
for tiie Study of bducatioii. Chica\;o: Univer.sitv of Chicago 
Press. 1972. Chapter 1 1, pp. 229-59, 

^ ^Solomon, D.; Be/dek, W. h,; and Rosenberg, b. 
Teaching Styles and Learning. Chicago: Ceiner for the 
Studv of biber.il Kthication of A(hilts, 1963. 



'"Use of Tests: lan[>loyinent and Counseling** by Thelma 
S|)encer. 

/' 

'Ulack. HilbV They Shall iVot Pass. New- York: 
William Morrow &r (fo.. 1963. p. 167. 



nan, Miriai 



W.isserman, A^iriam. School /w.v, NYC, U.S.A. New 
York: Outerbridge a Id Dienstfrey. !970. pp. 155-56. 



Quoted in black, op. cit., p. 260. 



Report of the NKA lask lorce on lesting. 1975. 

^ 1 or supporting; arguments, see the first internn report 
in Task Force and Other Reports presented to the Fifty- 
Second Representative .Vssembly of the iNat ional Education 
Association. July 3-6, 1973, Portland. Oregon (pp. 26-46). 

louse. Ernest; Rivers, Wen den;" and Stufflebeam, 
Daniel. An Assessment of the Michigan Accountability 
System. Michigan Education Association and National Edu- 
cation Association. March 1974. 

■*See tIic section on "Su[)portive Literature"* for a cita- 
tion of Mer<er*s study. 

The Awesome Danger of lntelligen( e \\'${$/* Ebony 
2<): 68.70, 72; August 1974. 



ERIC 



93 



96 



t embers of tlie /^qommittee arc Jean Blachfo^d, 
Pilialoha Lee Lo/, and Lawrence Perales representing the 
Task Force on Testing; Norman Goldman, director of 
instruction and professional xievelopment, Kew Jersey Edu- 
cal'on Association; Margaret Morrison, guidance counselor, 
Kucl^viile, Maryland; Gene V. Glass of the I^ibonitory of 
Eduea clonal Re£»\irch, University of Colorado; and Bernard 
Bartholomew, Bernard McKenna,and Frances Quinto, NEA 
staff. 



ERIC 



94 



